Synthesizing an anonymized multidimensional dataset featuring financial, economic, demographic, and personal traits data
Abstract
This paper presents a novel approach to generating synthetic data arrays that address the scarcity of datasets containing sensitive information due to restrictions imposed by legislation such as the GDPR and the Bank Secrecy Act. By integrating statistical methods, including Monte-Carlo simulation and Cholesky decomposition, with business logic, the study outlines a comprehensive methodology for the creation of multidimensional synthetic data sets. These datasets incorporate demographic, personality, financial, and banking variables to simulate the profiles of financially active individuals. This alternative to traditional data collection methods offers a solution to the challenges of accessing sensitive data while maintaining compliance with legal frameworks. The use of synthetic data allows for the preservation of variable interrelationships and provides a secure testing environment, despite the inherent complexities in generating high-quality synthetic databases. Validation of the synthesized data through the Kolmogorov-Smirnov test ensures their accuracy and relevance. This approach not only facilitates the advancement of data-driven models in fields where access to sensitive data is limited but also promotes the ethical use of data by adhering to privacy regulations. The paper demonstrates the potential of synthetic data to serve as a viable resource for scientific research, offering a detailed exploration of its generation process and the implications for future applications in sensitive areas of study.
References
2. Marchev, A., Marchev, V., 2023, Automated Algorithm for Multi-variate Data Synthesis with Cholesky Decomposition, ICACS 2023: the 7th International Conference on Algorithms, Computing and Systems, Larissa Greece, Association for Computing Machinery, New York, pp. 1 – 6, ISBN: 979-8-4007-0909-8;
3. Hansen, L., 1982, Large sample properties of generalized method of moments estimators, Econometrica, Vol. 50, No. 4 (JULY, 1982);
4. Julier, S., Uhlmann, J., 1996, A general method for approximating nonlinear transformation of probability distributions
5. Marchev, V, Marchev, A., 2021, “Methods for Simulating Multi-dimensional Data for Financial Services Recommendation”, Bulgarian Economic Papers, Center for economic thеories and policies, ISSN: 2367-7082, BEP 02-2021, Feb. 2021, http://www.bep.bg
6. Moral, P, Doucet, A., Jasra, A., 2006. SEQUENTIAL MONTE CARLO SAMPLERS. J. R. STATIST. SOC. B (2006) 68, PART 3, PP. 411–436
7. Dereniowsky, D., Kubale, M., 2003. Cholesky factorization of matrices in parallel and ranking of graphs, parallel processing and applied mathematics, 5TH INTERNATIONAL CONFERENCE, PPAM 2003, Czestochowa, Poland, Sep 7-10, 2003
8. Qu, W., Liu, H. & Zhang, Z., 2020, A method of generating multivariate non-normal random numbers with desired multivariate skewness and kurtosis. Behav Res 52, 939€“946
9. Bulgarian National Bank, 2023, Home, Statistics, Monetary and Interest Rate Statistics, Loans and Deposits by Amount Category and Economic Activity
10. Financial Supervision Commission, 2022, Insurance Activity, Statistics
11. b-t.com.ua, 2009, Statistika po kommentariyam k testu Ayzenka, http://b-t.com.ua/test_ayzenk_rez_komment.html
12. Trading Economics, 2022, Home Ownership Rate in Bulgaria, https://tradingeconomics.com/bulgaria/home-ownership-rate
13. Porozhanov, R., Broy na sobstvenitsite na zemedelski zemi i pritezhavanite ot tyah ploshti, agri.bg, 25.06.2018,
14. NSI, Home, Demographic and social statistics
15. NSI, 2023, Home, Business statistics R&D, Innovations and Information Society Information Society, INDIVIDUALS USING THE INTERNET BY PURPOSES;
16. NSI, 2021, Census 2021, Sofia, https://census2021.bg
17. Infostat.bg, Home, Demographic and social statistics, Population, Dec. 2023, infostat.nsi.bg/infostat/pages/module.jsf?x_2=80
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
By submitting a paper for publishing the authors hereby comply with the following provisions: 1. The authors retain the copyrights and only give the journal the right for first publication while licensing the work under Creative Commons Attribution License, which grants permissions to others to share the contribution citing this journal as first publication of the text. 2. The authors may enter separate, additional contractual relations for non-exclusive distribution of the published version of the work in this journal (e.g. to upload it in an institutional depository, or to be published in a book), given that they cite the first publication in this journal. 3. The authors are allowed and are encouraged to publish their works online (e.g. to upload it in an institutional depository, personal websites, social networks, etc.) before, during, and after the submission of the paper here, because this may lead to productive exchange, as well as earlier and larger referencing of the published works (see The Effect of Open Access).