Records, Computers and the Rights of Citizens (1973) V. Statistical-Reporting and Research Uses of Administrative Data Systems

. . . . in an information-rich world, the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.

Herbert Simon, "Designing Organizations for an Information-Rich World." In Martin Greenberger (Ed.), Computers, Communication, and the Public Interest (Baltimore, Md.: The Johns Hopkins Press), 1971, pp. 40-41.

V. Statistical-Reporting and Research Uses of Administrative Data Systems

Many automated personal data systems established primarily for administrative purposes are also used for statistical reporting and research. Since one advantage of computerizing administrative records is the capability thereby acquired for high-speed data retrieval and manipulation, a growing number of administrative data systems will be put to such additional uses. The safeguard recommendations in this chapter take account of that expectation.

Dimensions of the Problem

A modern organization, as a rule, maintains elaborate records about the money it spends, the people it serves, the quantities of goods and services it dispenses, and the number, qualifications, and salaries of the people who work for it. It does so, in part, because it must account for its activities to investors or taxpayers, and to other organizations that monitor and regulate its behavior.

An organization also needs to plan for the future. A firm selling to the public is interested in knowing what the public wants, or can be persuaded to want. A school needs to know about the financial and intellectual capabilities of students coming to it for learning. A government agency tries to forecast demand for the services it provides or supports.

These incentives to develop indicators of institutional performance make it difficult to control the quantity and variety of personal data stored in administrative record-keeping systems, and the statistical-reporting and research uses that are made of such data. The personal data that organizations collect for administrative purposes should be limited, ideally, to data that are demonstrably relevant to decision making about individuals. A substantial amount of personal data, however, appear to be collected because at some point someone thought they might be "useful to have," and found they could be easily and cheaply obtained on an application form, or some other record of an administrative transaction.

For example, college students applying for governmentguaranteed loans in one State have been required to provide the State guarantee agency with data on matters that had no direct relation to its individual entitlement decisions. These data, "for our statistical interest" as their intended use was described to the Committee, included race, marital status, sex, adjusted family income, and student-reported "average grades received for last term of fulltime post-high school study." These data have been used to produce statistical reports for internal agency use, for informal discussions with State legislators, and to "run a profile once yearly on . . . schools and . . . lenders to see if there is any odd pattern . . . occurring." On one occasion data in the system also have been used in a study conducted by an outside researcher. For making entitlement decisions, however, the data being collected in excess of those required by law, were described to us as not very helpful to the program, and at least two data elements-sex and student-reported grades-were said to be absolutely valueless.¹

The student loan case is but one illustration. The presentations of system managers and users yielded others. We found that decisions to collect personal data are being made without careful consideration of whether they will in fact serve the purposes for which they are supposedly being collected. As a result, substantial sums may be spent on comprehensive data collections for purposes that could often be much better served by other approaches, such as collecting statistical-reporting and research data only from a small sample of an organization's clients or beneficiaries. Most disturbing of all, we found that personal data in excess of those clearly needed for making decisions about individuals are sometimes collected in a way that makes them seem prerequisite to the granting of rights, benefits, or opportunities.

Mandatory or Voluntary Data Collection?

Poorly conceived data collection can result in various kinds of injury to individuals. As observed earlier, any file of personal data is a potential source of harm to individuals when it is used outside its appropriate context, and much of the personal data in administrative files either is a public record or is vulnerable to legal process.

There is also reason to believe that failure to separate information collected for statistical-reporting or research from data used in entitlement decisions may cause such decisions to be made unfairly. "Race" and "sex" are no longer asked on many application forms because of their acknowledged influence on some types of decision making about individuals. There are circumstances in which other kinds of data may have similarly unwarranted effects.² Moreover, collecting more information than is needed for day-to-day administrative decisions may discourage people from taking advantage of the services an organization offers. As one witness told the Committee:

. . . . our experience indicates that . . . . rigid adherence to proper data collection often "turns off" many clients, even when the interviewer is ingenious at gathering it. Also counselors often openly resent [having to ask] questions which actually may jeopardize their relationship with a client.

Perhaps most important of all is the intrusive effect of unrestrained data collection on self-esteem. Occasionally one hears that a wealthy citizen has hired a chauffeur and limousine to avoid disclosing his Social Security number, or some other item of information, to a State Department of Motor Vehicles. One is tempted to dismiss such protests as the trivial antics of rich eccentrics; yet they indicate the high cost of trying to escape personal inquiries of organizations that monopolize the distribution of certain privileges and benefits. The plight of the welfare beneficiary is especially extreme in this respect, but with all the forms that everyone of us is constantly filling out, it would probably be hard to find a single individual who has not had one occasion at least to wonder, "Why do they want to know that?" and "What will happen if I refuse to tell them?"

Collecting statistical-reporting and research data in conjunction with the administration of service and payment programs is not intrinsically undesirable. However, such supplementary data gathering should be carefully designed and managed, and should be performed only with the voluntary, informed cooperation of individual respondents. Otherwise only personal data directly and demonstrably germane to a decision about any given individual should be collected.

Separate collection of data for statistical reporting and research could have several practical advantages. First, by increasing the cost of supplementary data gathering, it discourages the collection of useless items. Second, it might reduce the amount of data that must be specially protected because it is identifiable. Although personal data maintained exclusively for statistical reporting and research often need broader and stronger protection than they are afforded,³ differentiating sharply among the purposes and uses of data files should encourage public confidence in organizational record-keeping practices and ease the access control burden that now weighs heavily on some system managers.

Third, separate collection of personal data for statistical reporting and research could help to make the collection process more reliable. We learned of instances in which an ambitious information system's appetite for data has induced careless statistical reporting. This problem appears to be especially prevalent where an information system has been established to help coordinate the activities of a number of small, loosely knit organizations. Such carelessness can frustrate the management objectives of a system by diluting the quality of data furnished to it in ways that may not be recognized or, if recognized, may be very difficult to control.⁴

Assuring Sound Secondary Uses of Administrative Data Systems

Administrative record-keeping operations can and do constitute rich sources of statistical-reporting and research data useful for many purposes. For example, the Federal government uses Internal Revenue Service records as a source of data for the quinquennial Census of Business and Manufacturers; hospital records are used to develop research data banks on particular diseases or disabilities; school and college records are used to study the relationship between academic performance and subsequent career achievement. Unfortunately, however, the mere existence of an administrative data base can create a strong temptation to use it for statistical reporting and research without sufficient attention to the appropriateness of doing so.

Three conditions that encourage sound use of data systems for statistical reporting and research are often absent from the environment in which administrative systems are designed and operated They are:

knowledge of the social processes by which data come to be collected;
management of data collection and analysis by individuals with strong statistical and research competence; and
independent expert scrutiny of analytic methods and results.

Knowledge of Data Collection Processes. Detailed understanding of how and why data come to be collected is often difficult, if not impossible, to achieve. For example, not everyone who is eligible for public assistance applies for it, and the amount and kind of information collected from each applicant may vary in subtle ways.⁵ Hence, if data from administrative systems are used for statistical reporting and research, the results must take account of systematic bias resulting from incompleteness in the data base. Measuring such bias can be expensive and time-consuming, and corrections for it can be even harder to make. Highly trained people are needed to conduct careful studies of the processes by which data in a system are being generated. Because of their expense and difficulty, however, and also because they can bring to light inadequacies in the overall performance of an organization, such studies tend not to be done.

Statistical and Research Competence. Because most administrative systems are committed to day-to-day record-keeping operations, they are seldom managed or staffed by persons with strong statistical and research competence. It is true that the statistical offices of a few large government agencies-notably the Social Security Administration and the Internal Revenue Service-have substantially influenced the statistical uses made of their principal data sources, which are mainly administrative records. Similar examples can be found at other levels of government and among private organizations, but there are also numerous instances in which such statistical and research competence is brought to bear only through informal or sporadic consulting arrangements, if at all.

Independent Scrutiny. Because administrative data systems are not created expressly for statistical reporting and research, they also tend to lack the strong ties to external groups of data users, and to the formal systems of professional peer review that characterize general purpose statistical-reporting and research operations. This isolation from independent expert scrutiny, coupled with the management orientation of administrative data systems, weakens the incentive to maintain high standards in the secondary statisticalreporting and research uses that are made of them.

Neglect of these three conditions is particularly dangerous in a governmental setting. In business, the quality of statistical reporting and research may be measured by the usefulness of such work to the planning and marketing functions that maintain a firm's competitive position. In government, however, feedback from the marketplace is attenuated. Save for the occasional newsworthy statistical report, the ancillary uses of administrative data systems may be ignored by outside professionals and invisible to the general public and its elected representatives.

In the Federal Government, formal arrangements for implementing the Federal Reports Act are supposed to serve as a check on the uses made of administrative record-keeping systems for statistical reporting and research. However, at other levels of government, the low visibility of such uses, coupled with the uneven impact of public information laws, can create an open invitation to misguided use of statistical reports and research findings based on administrative data.

We learned, for example, that one agency of a State government recently attempted to compare earnings declarations made by some public assistance beneficiaries to county welfare offices, with earnings of those same beneficiaries reported by their employers to a second State agency. This complex comparison of data derived from two quite different administrative record-keeping systems was undertaken mainly to verify the beneficiaries' eligibility for public assistance payments on a case-by-case basis, but it also resulted in a statistical report "showing" that a substantial percentage of the State's public assistance beneficiaries were engaged in "apparent fraud." The design of the comparison, and thus the resulting data, supported no such conclusion. Few people are aware of its technical failings, however, and it seems unlikely that many more will discover them, since appropriately documented data from the study have not been made available outside the sponsoring State agencies.

Recommendations

In light of our inquiry into the statistical-reporting and research uses of personal data in administrative recordkeeping systems, we recommend that steps be taken to assure that all such uses are carried out in accordance with five principles.

First, when personal data are collected for administrative purposes, individuals should under no circumstances be coerced into providing additional personal data that are to be used exclusively for statistical reporting and research. When application forms or other means of collecting personal data for an administrative data system are designed, the mandatory or voluntary character of an individual's responses should be made clear.⁶

Second, personal data used for making determinations about an individual's character, qualifications, rights, benefits, or opportunities, and personal data collected and used for statistical reporting and research, should be processed and stored separately.⁷

Third, the amount of supplementary statistical-reporting and research data collected and stored in personally identifiable form should be kept to a minimum.

Fourth, proposals to use administrative records for statistical reporting and research should be subjected to careful scrutiny by persons of strong statistical and research competence.

Fifth, any published findings or reports that result from secondary statistical-reporting and research uses of administrative personal data systems should meet the highest standards of error measurement and documentation.

It would be difficult to apply each of these principles uniformly to all administrative automated personal data systems. For this reason, we have not translated them into safeguard requirements to be enacted as part of a code of fair information practice. Adherence to their spirit, however, is warranted by the growing significance of statistical-reporting and research uses of administrative personal data systems -- both for individual data subjects and for the institutions maintaining such systems.

In addition, there are certain safeguards that can be feasibly applied to all administrative automated personal data systems used for statistical reporting and research. Specifically, we recommend that the following requirements be added to the safeguard requirements for administrative personal data systems:

Under I. General Requirements (Chapter IV, pp. 53-57), add

C. Any organization maintaining an administrative automated personal data system that publicly disseminates statistical reports or research findings based on personal data drawn from the system, or from administrative systems of other organizations, shall:

(1) Make such data publicly available for independent analysis, on reasonable terms; and

(2) Take reasonable precautions to assure that no data made available for independent analysis will be used in a way that might reasonably be expected to prejudice judgments about any individual data subject's character, qualifications, rights, opportunities, or benefits.

Under the Public Notice Requirement (Chapter IV, p. 58), add

(8a) The procedures whereby an individual, group, or organization can gain access to data used for statistical reporting or research in order to subject such data to independent analysis.

The purpose of general requirements C. (1) and (2) is to assure that when statistical reports or research findings based on personal data from administrative systems are used to affect social policy, the data will be available, in an appropriate form, for independent analysis. To comply with this requirement, an organization will have to plan carefully all publicly disseminated statistical-reporting and research uses of personal data in the administrative systems it maintains.

The public notice for an administrative personal data system will specify any statistical-reporting and research uses to be made of data in the system (requirement II. (7), p. 58) The additional information required by requirement (8a) will make it easier to obtain access to data for independent analysis.

¹ A representative of the State agency told the Committee that the agency would not compel a student applicant to provide this information "because we have come to find it is totally worthless . . . . . [A] t one time we thought it would be a viable way of sampling the type of student we would assist. We determined it is not much use . . . [but w]e have not taken it out."

²For a cogent analysis of the effects of "contextual" information on clinical disability determinations, see Saad L. Nagi, Disability and Rehabilitation (Columbus, Ohio: Ohio State University Press), 1969, especially Chapters 2 and 9. Discussion of this problem will also be found in Stanton Wheeler (Ed.), On Record: Files and Dossiers in American Life (New York: Russell Sage Foundation), 1969.

³The special problems of data maintained exclusively for statistical reporting and research are discussed in Chapter VI.

⁴ As one representative of a small group of agencies observed in his testimony before the Committee: Client- (rather than management-) oriented agencies are philosophically committed to research only secondarily, as a tool for delivering more effective services. Therefore, they often must be dragged kicking and screaming into the data collection business. This is totally apart from their finances or their training . . . . Where services are . . . interfered with, data collection goes out the window. Measurement error can then be quite high.

⁵These variations may result from practices rooted in a bureaucratic subculture of which the record-keeping operation is but one-albeit important-part. See, for example, the discussions of how juvenile court, welfare, credit, and elementary school records are generated, in Wheeler, op. cit., Chapters 2, 5, 11, and 12.

⁶Recall in this regard safeguard requirement III (1), recommended in Chapter IV (p. 59, above) for all administrative automated personal data systems; viz., that an individual asked to supply data for a system be informed clearly whether he is legally required or free to refuse to provide the data requested. That safeguard, when applied, will effectively eliminate de facto coercion of data subjects into providing more information than is needed for making administrative decisions.

⁷Separating the two types of data in this way would make it easier to apply the protection against compulsory disclosure recommended in Chapter VI (pp. 102-103, below).

Table of Contents