Scalable Data Archival & Purging Mechanism for High-Volume Applications

International Journal of Sustainability and Innovation in Engineering (IJSIE)
2024

https://www.doi.org/10.56830/IJSIE202412

Author

Zahir Sayyed

Abstract

With IoT, multimedia, financial, and healthcare data, data is exponentially increasing in modern applications, and storage scalability, cost, and privacy compliance are monumental hindrances to contemporary applications. The paper proposes an automatic policy-based scalable framework of data archival and purging in high-volume applications, resolving life cycle operations in real time with support of tiered storage, metadata-based decision-making, and distributed coordination to scale to petabyte scale. This architecture is a decomposition of ingestion, metadata indexing, policy evaluation, and execution, and isolates faults and allows work to be scaled horizontally. It implements a hybrid indexing strategy where the write-heavy metadata ingestion and search data are stored in wide-column stores, and the complex policy queries are made on search engines, and it also introduces configurable retention policies by using data age, access frequency, data size, and business-specific tags. The algorithms used in candidate selection are a rule-based heuristic, a derivative classifier based on supervised learning based on past access patterns, and a combination of the two methods. Synthetic and real-world e-commerce workloads show that decision throughputs can be as high as 20,000 requests per second, and policy latency can be well below a second, with storage costs reduced by an order of magnitude or more, and stay compliant with safe-delete pipelines that have audit trails and rollback capabilities. The analysis shows the trade-offs between latency and cost savings made throughout strategies and identifies the elasticity of the framework as it empowers loads of petabytes. The outcomes confirm the theoretical capabilities of automated, adaptive management of the lifecycle of cloud-native and on-premises infrastructures, which can provide a credible solution to data management requirements of modern solutions and additional operational resiliency.

Keywords;

Scalable Data Archival, Policy-Driven Purging, Tiered Storage Management, Metadata-Driven Decisioning, Hybrid Machine-Learning Heuristics

Download Full Article