Construction, operation and further development of a globally distributed Elastic Stack for company-wide aggregation, analysis and further processing of log files and events.
Brief description:
Our service solution includes the decentralized collection of log data in 15 globally distributed data lakes, which were built with the Elastic Stack. Generic data ingestion pipelines are made available with Logstash to enable a large number of heterogeneous systems throughout the company to deliver logs in order to aggregate them in dashboards or process them programmatically.
Situation:
Incident management plays a central role in globally distributed infrastructures. Enabling fast troubleshooting and reducing downtimes is essential. The central aggregation of events and status information enables early problem detection and contributes to the resilience of the entire infrastructure.
Customer request:
- Development, setup and operation of an event and log analysis platform
- Development and testing of Docker containers for operation in Kubernetes
- Linux automation with help scripts in Bash and Python
- Further development of the platform based on cutting edge best practices
- Reduction of downtimes through faster response
Challenges:
- Development and automation of an event and log analysis platform based on Kubernetes for the operation of Elasticsearch, Logstash, Kibana and Logstash Persistent Queues as a broker
- Automation of deployments and tests using Jenkins, Helm and Groovy scripts
- Connection of new data sources to the existing infrastructure (Syslog, Filebeat and REST API sources)
- Development of data processing filters with Logstash
KPIs:
- The customer has a highly available and scalable platform for all systems that can send their logs, metrics and inventory data to one place, which can be filtered and visualized via a full-text search
- The platform currently runs in 15 regions worldwide and processes up to 40,000 data records per second in each region
- Server management operations are simplified by consolidating all server data and logs from different manufacturers into one target form, which can be clearly displayed and monitored in the dashboards
- With the help of customized data integrations, any data source can be connected to OctoBus
- Based on the server logs and metrics, automatic notifications can be implemented via email, Slack or external ticket systems for alerting & further troubleshooting
- Whitelabel service platform on which other services are built
Netlution solution:
A customized, adaptive and fail-safe data lake solution through the use of global Kubernetes clusters.
Project duration:
The service started in 2018 and is ongoing.
Netlution services:
- Ensuring service, administration, optimization and further development of the entire stack
- Fast and uncomplicated onboarding of internal service consumers
- Documentation (incl. knowledge transfer management, e.g. wiki, recordings)
- Technologies/resources used
- Managed service (incl. SLAs) in a rolling deployment system (RES) with > 10 Netlution consultants/senior consultants
- Log data storage in the petabyte range
- Onboarding and coaching of internal customer employees