AutoOps – 一个整合了Project Management、Dev(Sec)Ops和Metrics Monitor的多功能平台

目前程序员（Developer）的日常主要工作可以分为几个部分，一是常规的程序开发工作（Feature Development）。得益于项目管理（Project Management）的快速发展和不断完善，在敏捷（Scrum/Agile）驱动下以Sprint的形式开展估算、计划、分配、实施、回顾，周期性地不断向前推进，进而达成项目目标（Project Complete）。在整个过程中，程序开发的工作量被量化成一个被广泛认可的统一指标，通常是故事点（Story Point，简称sp）。而一般情况下，sp都会结合团队（Scrum Team）成员的实际平均能力，换算成人*日或小时（Man * Day或Hour）。那么在一个Sprint里，Developer完成了多少sp，自然就可以换算成对应的ManDay或hr，再辅以个人实际花费时间（Time Spent Cost），如果需要做Workload Report的话，用这些数据会很清晰直观。

随着DevOps/DevSecOps的不断深入和推进，Developer的工作自然是增加了很多，Development周期内的Test、Build、Deploy，Operation的Issue、Bug、Vulnerability、Maintenance、Operation、（P-Level）Incident等等。而且增加的内容变得越来越难以量化，如果不能把这部分工作像Feature Development一样量化，那既不能准确反映Workload，也不利于将这些工作统筹到Sprint中管理，甚至有可能因为估计不足而没留够Resource，导致Feature Development进度受到挤压影响。

另外，Metrics Monitor方面，除了日常的Performance Monitor之外，由于整个Dev(Sec)Ops周期都由Developer负责，那么我们对以往的Real-Time Metrics Monitoring又提出了更高的要求，就是希望在Potential P-Level Incident发生之前，通过目前所掌握的Metrics情况，提前发布预警，避免严重事故的发生。

同样，由于不确定性的Operation和Monitoring Event的加入，使得如何Balance Sprint Workload又增加了许多难度，而对于Scrum Team Member来说，如何更好更精准地制作Workload Report又是切实利益相关的话题。

因此，我们做了一套AutoOps（AutoCore + AutoShell）系统，实现了以下一些功能：

Feature
- Collect ticket info (id, title, summary, type, priority, createdby, created, status, story point, project, …) from jira
- Collect ticket time spent cost for each status (waiting for support, feature analysis, wait for customer response, in development, in qa, in review, waiting for release)
- Collect ticket re-work count (status of dev, qa, review, release appears more than once)
Code
- Collect commit info from bitbucket(git) and link to jira ticket
Build & Test & Deploy
- Collect time spent cost for build, test, sonar, package from jenkins
- Collect test report & stats from jira
- Archive test report from jira to storage
- Collect time spent cost for deploy from cloudbees
- Collect deploy re-work from cloudbees
- Link deploy to sprint release
Operation
- Collect bug & vulnerability from sonarqube and auto create jira ticket (type = bug) to follow in sprint
- Link bug & vulnerability to commit (then ticket)
- Auto-script for maintenance
Monitor
- Collect metrics from self apps
- Collect metrics from newrelic
- Collect metrics from aws cloudwatch
- Collect metrics from product database
- Collect metrics from splunk
- Receive metrics from self apps
- Aggerate metrics by catalog, service, host, pid
- Calculate performance capacity baseline
- Display dashboard (Kibana, Grafana, self-draw)
- Trigger potential warning/alert
- Trigger Operation-Auto-Script to resolve warning/alert automatically
- Publish key service status to statuspage
- Create jira ticket (type = issue) for alert
Knowledge Database (CMS)
- Link jira ticket to confluence (wiki)
Data Dictionary
- Collect data point metadata from product database
- Collect dataflow from upstream to downstream
- Collect data point-dataflow-app relationship
- Link Data Dictionary to jira ticket
Report
- Generate sprint-based incremental feature report
- Generate sprint-based incremental test report
- Generate sprint-based incremental bug & vulnerability report
- Generate sprint-based incremental release report
- Generate sprint-based incremental operation report
- Generate sprint-based incremental service report
- Generate sprint-based incremental warn & alert report
- Generate sprint-based incremental knowledge database report
- Generate sprint-based incremental data dictionary report
- Generate project-based management report
- Generate service-based metric report
- Generate service-based support report
- Generate customer-based support report
- Generate team-based workload report
- Generate team-member-based workload report

我戏称其为杂七麻八（Misc.）平台，首先是因为他包含了工作中的几乎全部内容，再者是因为他是用很多种语言拼凑在一起写的：主体Python + Django，JavaScript什么的就不说了，还掺杂着Shell，Batch，Java，C#，PowerShell，Node.js，Groovy，还有IaC（Infrastructure as Code）的Terraform和AWS Native等等，总之是为达目的无所不用其极。

最终结果就是：

Sprint Workload更加精准，质量和速度保持在很高的水准之上
Report Automation使我们不用再为Report头疼，定期自动生成的各种Report，自动发送给个人后看一眼没什么大问题就直接Forward就搞定了，一切以Data + Dashboard说话，一眼专业
完备的Issue、Bug、Vulnerability Tracing
完善的Knowledge Database Documentation，通过HotSpot生成FAQ减少了很多Upstream/Downstream的Support Workload
Expected vs. Actual，通过自动对比实际Workload和估算Workload为Retrospective提供详细提升建议
Quality & Test & Release Report为Retrospective不断提供提升Release Quality的建议
Metrics Aggregation + Performance Baseline + Auto-Script直接自动将很多Potential P-Level Incident直接扼杀在初期
和AWS Lambda + AWS AutoScalingGroup相结合，能够脱离传统的CPU + Memory + Disk的Monitor而去考虑实际Business需求，更精准地Scale In/Out
Data Dictionary不但成为了我们每个Sprint的customer-based产出成果Report，还成为了公司内部著名的wiki-dict

AutoOps – 一个整合了Project Management、Dev(Sec)Ops和Metrics Monitor的多功能平台

评论

发表回复取消回复

AutoOps – 一个整合了Project Management、Dev(Sec)Ops和Metrics Monitor的多功能平台

分享到：

评论

发表回复 取消回复

发表回复取消回复