Home United States USA — software Capacity and Compliance in Hybrid Cloud, Multi-Tenant Big Data Platforms

Capacity and Compliance in Hybrid Cloud, Multi-Tenant Big Data Platforms

173
0
SHARE

We cover capacity governance, technical risk management, security compliance, and site reliability engineering in big data platforms.
Join the DZone community and get the full member experience. As organizations are realizing how Data-Driven insights can empower their strategic decisions and increase their ROI, the focus is on building Data Lakes and Data Warehouses where all the Big Data can be safely archived. Big data can then be used to empower various data engineering, data science, business analytics, and operational analytics initiatives to benefit the business by improving operational efficiency, reducing operating costs, and making better strategic business decisions. However, the exponential growth in the data that we humans consume and generate day to day makes it necessary to have a well-structured approach toward capacity governance in the Big Data Platform. Capacity governance and scalability engineering are inter-related disciplines, as this requires a comprehensive understanding of our compute and storage capacity demands, infrastructure supply, and their inter-dynamics to develop an appropriate strategy for scalability in the big data platform. In addition to this, technical risk resolution and security compliance are equally important aspects of capacity governance. Similar to any other software application, big data applications operate upon personally identifiable information (PII) data of the customers and impact the critical business functions and strategic decisions of the company. Likewise, big data applications must follow the latest data security and service reliability standards at all times. All the IT Hardware and software components must be regularly updated or patched with the latest security patches or bug fixes. Regardless of which infrastructure it is deployed, each component of the big data platform must be continuously scanned to quickly identify and resolve any potential technical or security risk. The diagram above is the functional architecture of the Capacity Governance framework for a Big Data platform. The rightmost section in the diagram represents the infrastructure or supply stream which is closer to the technical users. The leftmost section of the diagram is the demand stream or business-as-usual stream which is closer to business users who would be more interested in solving business use cases rather than getting into technical details of the solution. The center section represents the Big Data framework where we manage the supply against the demands using an appropriate set of tools and technologies. To keep the article brief we will keep the technical architecture out of the scope as a bunch of microservices and jobs were utilized to curate the insights from a diverse set of tools appropriate for specific problem domains. At the start we asked the obvious question — is it worth putting effort to build a solution around this problem? Or we can identify one or few tools in the market to solve the whole capacity and compliance problem. But over the time working with a bunch of monitoring, APM, and analytics tools we understood that we will need to weave through the information coming out from a diverse set of tools that are efficiently solving the diverse set of domain-specific problems, to build the bigger picture at platform level. While we should not try to re-invent what existing tools already offer, we should also not try to answer all of the functional problems through one tool or singular approach. For example, the tools suitable for monitoring, log analysis, and application performance management for spark applications running on YARN are completely different from the tools used for the same purpose for microservices deployed to containers in Kubernetes. Likewise, it will be irrelevant to retrofit a tool appropriate for monitoring, APM, and autoscaling in on-premise VMware to Virtual machines provisioned in the public cloud. Each public cloud offers a bunch of native monitoring tools and autoscaling API(s) for the same purpose. A major portion of data science or AI Industrialization project is not necessarily MapReduce/spark/big data, there is a considerable portion of the project that interfaces with external systems or are normally deployed as conventional software or microservice. This just requires a diligent selection of DevOps tools and underlying deployment platforms. For financial organizations, it is extremely important to ensure continuous compliance and high availability of the Big Data Platform. This requires a comprehensive understanding of available compute and storage infrastructure (supply), the 4 Vs (Volume, Velocity, Variety, Veracity) and nature of the workload (processing), SLA, and Data Sensitivity requirements (Demand/BAU) to build a suitable and scalable solution for a particular business use case while ensuring high availability and capacity redundancy in the Big Data Platform. Our approach was to follow the maturity matrix, initially from data collection and exploration, to identifying trends and performing analytics, and then to leverage the analytics to empower intelligent operations. We targeted to achieve this in 7 phases: 1. Comprehensive Inventory The first phase of the project was to build a comprehensive cross-platform inventory. Virtual Machines Infrastructure level inventory was readily available from On-Prem monitoring tools, however, the API(s) and pattern of monitoring in the containerized platform and different public clouds were different. Besides this, only VM Level inventory was not sufficient. The platform used multiple In-House, Vendor Supported, and Open Source middleware which consist of distinct components involved in requisite functions. We had to implement the publishers, observers, and subscribers to perform the role tagging for middleware components deployed to Virtual Machine, Containerized (Kubernetes/Openshift), or Public cloud infrastructure platforms. 2. Infrastructure Utilizations, Consumption Trends, and Analytics Once we have a clear understanding of the roles deployed across the hybrid cloud infrastructure, we needed to overlay the infrastructure utilization trends and classify the data into more fine-grained component-level metrics. Also, the monitoring data must solve both the problems: 3. Demand Analysis and Forecasting Once we managed to obtain the tenant-wise occupancy data from infrastructure utilization, we were then in a position to identify the growth trends at storage, compute and project size levels. Based on the relative growth of individual datasets/tenants we had to identify the top occupants, the fastest-growing datasets/tenants, the top (compute) wasters, and non-critical noisy neighbors. 4. Performance Optimization The next natural step was to identify the top wasters and work together with Business Unit partners to optimize them and reduce wastage, ensure redundancy and increase service reliability. At each point in time, there are thousands of job runs scheduled by the job scheduler onto the cluster, and each queue contains hundreds of jobs submitted serving daily, weekly, monthly or ad-hoc schedules. It was thus decided to identify and tune the biggest CPU, and RAM wasters and to reschedule the non-critical jobs to non-critical or off-peak time slots. 5. High Availability, Capacity Redundancy, and Scalability Identifying the business-critical workloads and ensuring sufficient capacity redundancy for them, re-aligning non-critical noisy neighbors to off-peak hours is one of the critical key performance indicators for the site reliability engineering team. Besides this ensuring high availability at each middleware component level, identifying site reliability issues, and making sure each platform component is individually and collectively scalable to handle peak hour surges could only be achieved when we monitor and observe the operational performance data over the time. For each middleware component, high availability (preferably active-active) at both the worker and the master level is an essential requirement, particularly for business-critical category 1 applications. For In-House applications, thorough and consistent code analyses, unit-integration-performance testing, dependency license, and vulnerability scanning data need to be gathered at each component level. For managed services, in the public cloud, the service level agreement needs to be established with the respective cloud infrastructure providers. The same software support, patching, and incident management support SLA needs to be ensured by external software providers. 6. Autoscaling, Self-Healing, Cost Optimizations In the final phase of maturity, the capacity governance framework will become more intrusive, where our decision engines not only produce schedules, weekly-monthly roosters for SRE Teams but also can auto-heal or clean up some of the services without human intervention (depending upon the criticality of the environment, service level agreement and the impact radius of the change). All the data and analytics captured above, in the end, must be curated to identify the most performant and cost-effective infrastructure suitable for each middleware component. The data is also used to create autoscaling policies, self-healing automation (specifically for non HA components), performance and cost optimization reports, and recommendation models. 7. Technical Risk and Continuous Compliance Procuring and setting up a performant and security-compliant system at once is not enough, it is necessary to ensure that every component remains compliant and free from ongoing threats and vulnerabilities regardless of which infrastructure platform and topology it is deployed.

Continue reading...