Statistical Database Security
Name
Institutional Affiliation
Abstract
Statistical databases (SDBs) are primarily databases used for returning statistical information obtained from records to user queries for statistical data analysis. At times, the correlation of substantial statistics will lead to an inference of confidential information stored within the Statistical Database (SDB). Notably, the inference of confidential information from the statistical data raises security concerns since the information should not be disclosed to unauthorized parties. The threats to data security come from a snooper’s attempt to infer particular confidential information on a given entity. These security threats may lead to full or partial disclosure. Several options can be used for protection, as discussed in this research paper. These include query restriction, perturbation, data masking, and statistical control. Each of these approaches has its pros and cons, and it remains advisable that several approaches are implemented simultaneously for better protection.

Table of Contents
Abstract 2
Introduction 4
The Security Challenge To SDBs 5
Inference from the Statistical Database 6
Approaches for Statistical Database Security 8
Query Restriction 8
Perturbation 8
Data Masking 10
Statistical Control 10
Conclusion 11
References 122

Introduction
Particular databases have confidential or secret data for the entities that need to be protected from malicious parties. Statistical databases (SDBs) are primarily databases used for returning statistical information obtained from records to user queries for statistical data analysis (Ahlswede & Aydinian, 2006). The kind of statistical information to be retrieved includes the population averages, the sum, count, minimum, maximum, standard deviations, among others. The SDBs will serve several critical objectives. They store rich data content to provide population statistics by age, income levels, education levels. The statisticians working with the government, market research firms, and institutions that estimate the economic indicators rely on the SDBs (Pionnah, 2003). The professionals select records from the databases to perform statistical and mathematical functions.
It is prudent to note that the professionals need access to the SDBs. However, there is a considerable difference between the users of an operational database requiring access privileges and the professionals that need access privileges to the SDBs. The users of an operational database require information to run the daily business operations. In this case, the users require access privileges to the individual records in the database (Pionnah, 2003). Conversely, professionals utilizing SDBs require access privileges to access groups of records and undertake mathematical and statistical calculations from the identified groups. These professionals are not interested in single records but the samples that contain groups of records.
At times, the correlation of substantial statistics will lead to an inference of confidential information stored within the statistical database (SDB). Notably, the inference of confidential information from the statistical data raises security concerns since the information should not be disclosed to unauthorized parties (Ahlswede & Aydinian, 2006). An SDB will be considered secure if no protected data could be inferred from the available queries. Nonetheless, if the users could infer protected information in the SDB from the responses to queries, then the SDB is compromised and will need security mechanisms for protecting confidential information (Hasani, 2020). The threats to data security come from a snooper’s attempt to infer particular confidential information on a given entity. These security threats may lead to full or partial disclosure. The disclosure is considered to happen if a snooper gets the answers to one or more queries or can obtain an almost accurate estimate of the confidential attributes of the individual entity.
The Security Challenge To SDBs
Undoubtedly, the security attacks on SDBs are direct, and their success relies on the level of protection and the mechanism incorporated within the SDB. The outcome from the attack will be expected and required. In the event that the attack fails, the hacker moves to the following stage (Albalawi, 2018). The attacker launches an indirect attack if the objective is to extract different types of information. There are numerous query combinations whose intention is to cheat the security mechanism. The tracking attack is utilized for breaching the databases which have incorporated the suppression mechanisms and have dominant outcomes. These attacks will be used against SDBs with short answers to the queries. The principle followed in this attack type is the addressed direct query and a minimal number of claims, but the denials main claim results of the claim are zero. In the event that the attack fails the complex claims due to the suppression mechanisms, then the database gets queries on the set of claims, and responses to these claims get studied from the extracted sensitive database. Within the literature, tracking attacks are also called the Linear System Vulnerability (Albalawi, 2018). Statistical queries are allowed for the database, but they remain real security concerns as they allow access to individually sensitive data. Hence, the database containing confidential information needs to ensure that the data does not get compromised. However, the attacks in the databases are done by the authorized user of computer systems.
It is prudent also to note that the combination of aggregation functions such as the sum, count, min, and max will be intelligently used for accessing and obtaining protected data. Extensive research has tried to develop an infrastructure that could prevent and avoid statistical inferences by separating data access from statistical analysis (T. General & R. Statistics, Eds., 2005). Nonetheless, the problem affiliated with the approach is that heavy operations are needed for building and running. Therefore, statistical database security becomes an interference challenge since it entails deriving sensitive information from non-sensitive information. Therefore, these databases face three types of attacks: direct, indirect, and tracker attacks. The direct attack utilizes the aggregated function applied to a small sample leasing to a breach of confidential data and entity information becoming compromised. The indirect attack entails combining a range of aggregates for attacking; the third attack type, the tracker, has been demonstrated to be more effective than the indirect attack.
Inference from the Statistical Database
The statistical user of an underlying database having individual records is required to obtain only the aggregate or statistical data and forbidden from accessing the actual individual records. The inference problem within this context is that the user could infer confidential information from the individual entities represented in the database leading to a compromise. A positive compromise happens if the user deduces the value of an attribute affiliated with the individual entity, and a negative compromise happens when the user deduces that a specific value to an attribute is not affiliated with the individual entity. For instance, the statistic sum (EE: Female, GP) = 2.5 compromises the database if the user knows that Baker is the only female EE student.
In some scenarios, a sequence of queries could reveal information. For instance, in case a questioner knows that Baker is a female EE student but is not aware if she is the only one. Considering the following sequence of two queries:
(count (EE • Female) = 1; sum (EE • Female, GP) = 2.5)
The sequence does reveal confidential information. Some knowledge relating to one individual in a database could be combined with queries for revealing protected information. For the large database, there could be limited or no opportunities for singling out a particular record with a distinct set of characteristics such as one being the only female student in a department, another attack angle is present to a user that knows of incremental changes to the database. For instance, in a personnel database where employees’ total salaries could be queried. If the questioner is aware of the following information:
➔ Salary range for a new systems analyst with a BS degree is $45K to $55K
➔ Salary range for a new systems analyst with a MS degree is $55K to $65K
In case the two new systems analysts are included to the payroll and a change in the salaries total is $130K; then the questionnaire will know that the two new employees do have an MS degree.
Generally, the inference challenge to the SDB could be described as when a characteristic function C defines a subset of records or rows within the database. The query that uses C provides statistics on the particular subset. If the subset is small enough even for one record, the questionnaire will be able to infer characteristics of a single person or a small group. Also, for the larger subsets, the data’s nature or structure could be such that the unauthorized information could be released.
Approaches for Statistical Database Security
Query Restriction
This approach to statistical database security rejects a query that could cause a compromise (Stallings, 2007). The answers that are provided are accurate. These techniques defend against inference by restricting statistical queries from revealing the user’s confidential information. In this context, the restrictions mean having some queries denied.
Query size restriction makes up for the simplest form of query restriction. A database with size N( the number of rows or records), a query q(C) is allowed solely if the number of records matching to C satisfies the equation: . In this case k represents a fixed integer greater than 1. Therefore, the user cannot access any query set that is less than k records. The upper bound is also required. Designating All as the set of all the records within the database. In case q(C) is allowed due to |X(C)| k, with no upper bound then the user could compute q(C) = q(All) − q(˜C). The upper bound of N-k does guarantee that the user does not obtain access to the statistics on query sets which are lower than k records. Practically, queries of the form q(All) are permitted allowing users to easily access statistics that are calculated on the whole database (Stallings. 2007). The query size restriction will counter attacks depending on very small query sets. For instance, if a user knows that a particular individual I satisfies a particular characteristic formula C such as Allen being a female CS major, if the query count (C) returns 1 then the user can uniquely identify I. the user could easily test I has a specific characteristic D with a query count (C D). Also, if the user could learn the value of a numerical attribute A for I with a query sum (C A).
Perturbation
Query restriction techniques can be expensive and challenging such that they thwart inference attacks, especially when the user does have supplementary knowledge. The simpler and highly effective approach for the larger database would be to add noise to the generated statistics from the original data. The data within the SDB could be modified or rather perturbed to bring forth statistics that are not useful for inferring values for individual records. This is considered data perturbation (Stallings, 2007). Alternatively, during making a statistical query, the system could generate statistics that are modified than what the original database would provide, thus thwarting any attempts at gaining knowledge on individual records. This is considered output perturbation. Without considering the particular perturbation technique used, the designer needs to try to produce statistics that accurately reflect the underlying database. Due to the perturbation, differences will be present between the perturbed results and ordinary results from the database. Notably, the objective is to minimize the differences and provide consistent results.
The first data perturbation technique is data swapping which entails transforming the database by substituting values that conform to a similar assumed underlying probability distribution (Stallings, 2007). The second method entails generating statistics from the assumed underlying probability distribution. Concerning the output perturbation techniques, they are suitable for large databases and are similar to the approach employed by the United States Census Bureau. The technique entails first the user issuing a query q(C) that needs to return a statistical value X(C). Then the system will replace X(C) with a sampled query set, an appropriately selected subset of X(C). Finally, the system calculates the requested statistic in this sampled query to return the value. The other methods following under the output perturbation approach involve the calculation of a statistic in the requested query set then adjusting the answer by a given amount following a systematic or randomized fashion. Notably, each of these techniques is focused in thwarting the tracker attacks and other attacks that could be made against the query restriction techniques,
Data Masking
This approach will also protect sensitive data by adding or masking the original data with random values or data. The primary objective is protecting the data that is considered personally identifiable information (PII), sensitive data, or commercially sensitive data (Albalawi, 2018). This masked data should be made to look as if it is real and consistent and needs to be useful for the valid test cycles during masking, shuffling, and the masking out technique is some of the masking types that could be done. The shuffling technique is used for masking the existing data. In such a manner where no values are present within the original rows, replacing existing values in rows is done by moving the values between rows. The substitution of particular characters with mask characters will sanitize the data.
Statistical Control
The statistical control queries encompass the application of statistical-aggregate functions to the database tables. Users are generally allowed to sieve out statistical information in the population but not retrieve information about an individual. The Database Management Systems (DBMS) need to ensure that the information relating to individuals remains private and confidential while also providing beneficial statistical data extractions to the users (Franklin & Asagba, 2021). It is prudent that statistical control on databases emphasizes protecting users’ privacy. Therefore the statistical security mechanisms will disallow retrieving the data on individuals through the prevention of queries retrieving characteristic values and allowing the requests that solely involves statistical aggregate functions,
Another functional means for securing information within the database is encryption and key infrastructures. This approach secures data within the insecure environment through encryption algorithms for scrambling data with a predetermined key for recovering the cipher through decryption (Franklin & Asagba, 2021). To this effect, the National Institute of Standards (NIST) would replace the commonly used 16-bit block size Data Encryption Standard (DES) with the 128-bit block sixes Advanced Encryption Standards (AES) (Franklin & Asagba, 2021). The DES and AES encryption algorithms use the secret keys, also known as the symmetric key algorithms. Public key encryption could also be incorporated, using two keys for encryption and decryption.
The digital signature is another mechanism that uses encryption techniques for providing authentication services in electronic commerce applications (Franklin & Asagba, 2021). A digital signature is an approach of affiliating distinct emblems of an individual within a body of text. This emblem needs to be unforgettable if others need to be aware of authenticating and verifying the originator of the signature. A digital certificate will combine the value of a public key with a person’s identification or the service which holds a corresponding private key into a digitally signed statement (Franklin & Asagba, 2021). The certificates are issued and signed by a certification authority such that the entity that receives the certificate from the CA becomes the certificate’s subject.
Conclusion
Statistical databases being vital assets need to be protected from deliberate and unintentional threats through effective security systems. Failure to do this will mean that malicious parties can access confidential information identified to different individuals. Notably, several options can be used for protection, as discussed. These include query restriction, perturbation, data masking, and statistical control. Each of these approaches has its pros and cons, and it remains advisable that several approaches are implemented simultaneously for better protection.

References
Ahlswede, R., & Aydinian, H. (2006, July). On security of statistical databases. In 2006 IEEE International Symposium on Information Theory (pp. 506-508). IEEE.
Albalawi, U. (2018, December). Countermeasure of statistical inference in database security. In 2018 IEEE International Conference on Big Data (Big Data) (pp. 2044-2047). IEEE.
Franklin, M. & Asagba, P.A. (2021). Information Asset Security: Databases and Access Control. International Journal of Computer Science and Mathematical Theory E-ISSN 2545-5699 P-ISSN 2695-1924, 7(1).
Hasani. (2020, August 24). Statistical database security. Retrieved from https://www.geeksforgeeks.org/statistical-database-security/
Ponniah, P. (2003). Chapter 16: Database Security. Database design and development: an essential guide for IT professionals. John Wiley & Sons, Inc.
Stallings, W. (2007). Computer security and statistical databases. Retrieved from https://www.informit.com/articles/article.aspx?p=782117
T. General & R. Statistics, Eds. (2005). Monographs of Official Statistics, ser. 9279011081. Work session on statistical data confidentiality.

Published by
Write
View all posts