AutoOps – 一个整合了Project Management、Dev(Sec)Ops和Metrics Monitor的多功能平台

目前程序员(Developer)的日常主要工作可以分为几个部分,一是常规的程序开发工作(Feature Development)。得益于项目管理(Project Management)的快速发展和不断完善,在敏捷(Scrum/Agile)驱动下以Sprint的形式开展估算、计划、分配、实施、回顾,周期性地不断向前推进,进而达成项目目标(Project Complete)。在整个过程中,程序开发的工作量被量化成一个被广泛认可的统一指标,通常是故事点(Story Point,简称sp)。而一般情况下,sp都会结合团队(Scrum Team)成员的实际平均能力,换算成人*日或小时(Man * Day或Hour)。那么在一个Sprint里,Developer完成了多少sp,自然就可以换算成对应的ManDay或hr,再辅以个人实际花费时间(Time Spent Cost),如果需要做Workload Report的话,用这些数据会很清晰直观。

随着DevOps/DevSecOps的不断深入和推进,Developer的工作自然是增加了很多,Development周期内的Test、Build、Deploy,Operation的Issue、Bug、Vulnerability、Maintenance、Operation、(P-Level)Incident等等。而且增加的内容变得越来越难以量化,如果不能把这部分工作像Feature Development一样量化,那既不能准确反映Workload,也不利于将这些工作统筹到Sprint中管理,甚至有可能因为估计不足而没留够Resource,导致Feature Development进度受到挤压影响。

另外,Metrics Monitor方面,除了日常的Performance Monitor之外,由于整个Dev(Sec)Ops周期都由Developer负责,那么我们对以往的Real-Time Metrics Monitoring又提出了更高的要求,就是希望在Potential P-Level Incident发生之前,通过目前所掌握的Metrics情况,提前发布预警,避免严重事故的发生。

同样,由于不确定性的Operation和Monitoring Event的加入,使得如何Balance Sprint Workload又增加了许多难度,而对于Scrum Team Member来说,如何更好更精准地制作Workload Report又是切实利益相关的话题。

因此,我们做了一套AutoOps(AutoCore + AutoShell)系统,实现了以下一些功能:

  • Feature
    • Collect ticket info (id, title, summary, type, priority, createdby, created, status, story point, project, …) from jira
    • Collect ticket time spent cost for each status (waiting for support, feature analysis, wait for customer response, in development, in qa, in review, waiting for release)
    • Collect ticket re-work count (status of dev, qa, review, release appears more than once)
  • Code
    • Collect commit info from bitbucket(git) and link to jira ticket
  • Build & Test & Deploy
    • Collect time spent cost for build, test, sonar, package from jenkins
    • Collect test report & stats from jira
    • Archive test report from jira to storage
    • Collect time spent cost for deploy from cloudbees
    • Collect deploy re-work from cloudbees
    • Link deploy to sprint release
  • Operation
    • Collect bug & vulnerability from sonarqube and auto create jira ticket (type = bug) to follow in sprint
    • Link bug & vulnerability to commit (then ticket)
    • Auto-script for maintenance
  • Monitor
    • Collect metrics from self apps
    • Collect metrics from newrelic
    • Collect metrics from aws cloudwatch
    • Collect metrics from product database
    • Collect metrics from splunk
    • Receive metrics from self apps
    • Aggerate metrics by catalog, service, host, pid
    • Calculate performance capacity baseline
    • Display dashboard (Kibana, Grafana, self-draw)
    • Trigger potential warning/alert
    • Trigger Operation-Auto-Script to resolve warning/alert automatically
    • Publish key service status to statuspage
    • Create jira ticket (type = issue) for alert
  • Knowledge Database (CMS)
    • Link jira ticket to confluence (wiki)
  • Data Dictionary
    • Collect data point metadata from product database
    • Collect dataflow from upstream to downstream
    • Collect data point-dataflow-app relationship
    • Link Data Dictionary to jira ticket
  • Report
    • Generate sprint-based incremental feature report
    • Generate sprint-based incremental test report
    • Generate sprint-based incremental bug & vulnerability report
    • Generate sprint-based incremental release report
    • Generate sprint-based incremental operation report
    • Generate sprint-based incremental service report
    • Generate sprint-based incremental warn & alert report
    • Generate sprint-based incremental knowledge database report
    • Generate sprint-based incremental data dictionary report
    • Generate project-based management report
    • Generate service-based metric report
    • Generate service-based support report
    • Generate customer-based support report
    • Generate team-based workload report
    • Generate team-member-based workload report

我戏称其为杂七麻八(Misc.)平台,首先是因为他包含了工作中的几乎全部内容,再者是因为他是用很多种语言拼凑在一起写的:主体Python + Django,JavaScript什么的就不说了,还掺杂着Shell,Batch,Java,C#,PowerShell,Node.js,Groovy,还有IaC(Infrastructure as Code)的Terraform和AWS Native等等,总之是为达目的无所不用其极。

最终结果就是:

  • Sprint Workload更加精准,质量和速度保持在很高的水准之上
  • Report Automation使我们不用再为Report头疼,定期自动生成的各种Report,自动发送给个人后看一眼没什么大问题就直接Forward就搞定了,一切以Data + Dashboard说话,一眼专业
  • 完备的Issue、Bug、Vulnerability Tracing
  • 完善的Knowledge Database Documentation,通过HotSpot生成FAQ减少了很多Upstream/Downstream的Support Workload
  • Expected vs. Actual,通过自动对比实际Workload和估算Workload为Retrospective提供详细提升建议
  • Quality & Test & Release Report为Retrospective不断提供提升Release Quality的建议
  • Metrics Aggregation + Performance Baseline + Auto-Script直接自动将很多Potential P-Level Incident直接扼杀在初期
  • 和AWS Lambda + AWS AutoScalingGroup相结合,能够脱离传统的CPU + Memory + Disk的Monitor而去考虑实际Business需求,更精准地Scale In/Out
  • Data Dictionary不但成为了我们每个Sprint的customer-based产出成果Report,还成为了公司内部著名的wiki-dict

评论

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理