目前程序员(Developer)的日常主要工作可以分为几个部分,一是常规的程序开发工作(Feature Development)。得益于项目管理(Project Management)的快速发展和不断完善,在敏捷(Scrum/Agile)驱动下以Sprint的形式开展估算、计划、分配、实施、回顾,周期性地不断向前推进,进而达成项目目标(Project Complete)。在整个过程中,程序开发的工作量被量化成一个被广泛认可的统一指标,通常是故事点(Story Point,简称sp)。而一般情况下,sp都会结合团队(Scrum Team)成员的实际平均能力,换算成人*日或小时(Man * Day或Hour)。那么在一个Sprint里,Developer完成了多少sp,自然就可以换算成对应的ManDay或hr,再辅以个人实际花费时间(Time Spent Cost),如果需要做Workload Report的话,用这些数据会很清晰直观。
随着DevOps/DevSecOps的不断深入和推进,Developer的工作自然是增加了很多,Development周期内的Test、Build、Deploy,Operation的Issue、Bug、Vulnerability、Maintenance、Operation、(P-Level)Incident等等。而且增加的内容变得越来越难以量化,如果不能把这部分工作像Feature Development一样量化,那既不能准确反映Workload,也不利于将这些工作统筹到Sprint中管理,甚至有可能因为估计不足而没留够Resource,导致Feature Development进度受到挤压影响。
另外,Metrics Monitor方面,除了日常的Performance Monitor之外,由于整个Dev(Sec)Ops周期都由Developer负责,那么我们对以往的Real-Time Metrics Monitoring又提出了更高的要求,就是希望在Potential P-Level Incident发生之前,通过目前所掌握的Metrics情况,提前发布预警,避免严重事故的发生。
同样,由于不确定性的Operation和Monitoring Event的加入,使得如何Balance Sprint Workload又增加了许多难度,而对于Scrum Team Member来说,如何更好更精准地制作Workload Report又是切实利益相关的话题。
因此,我们做了一套AutoOps(AutoCore + AutoShell)系统,实现了以下一些功能:
- Feature
- Collect ticket info (id, title, summary, type, priority, createdby, created, status, story point, project, …) from jira
- Collect ticket time spent cost for each status (waiting for support, feature analysis, wait for customer response, in development, in qa, in review, waiting for release)
- Collect ticket re-work count (status of dev, qa, review, release appears more than once)
- Code
- Collect commit info from bitbucket(git) and link to jira ticket
- Build & Test & Deploy
- Collect time spent cost for build, test, sonar, package from jenkins
- Collect test report & stats from jira
- Archive test report from jira to storage
- Collect time spent cost for deploy from cloudbees
- Collect deploy re-work from cloudbees
- Link deploy to sprint release
- Operation
- Collect bug & vulnerability from sonarqube and auto create jira ticket (type = bug) to follow in sprint
- Link bug & vulnerability to commit (then ticket)
- Auto-script for maintenance
- Monitor
- Collect metrics from self apps
- Collect metrics from newrelic
- Collect metrics from aws cloudwatch
- Collect metrics from product database
- Collect metrics from splunk
- Receive metrics from self apps
- Aggerate metrics by catalog, service, host, pid
- Calculate performance capacity baseline
- Display dashboard (Kibana, Grafana, self-draw)
- Trigger potential warning/alert
- Trigger Operation-Auto-Script to resolve warning/alert automatically
- Publish key service status to statuspage
- Create jira ticket (type = issue) for alert
- Knowledge Database (CMS)
- Link jira ticket to confluence (wiki)
- Data Dictionary
- Collect data point metadata from product database
- Collect dataflow from upstream to downstream
- Collect data point-dataflow-app relationship
- Link Data Dictionary to jira ticket
- Report
- Generate sprint-based incremental feature report
- Generate sprint-based incremental test report
- Generate sprint-based incremental bug & vulnerability report
- Generate sprint-based incremental release report
- Generate sprint-based incremental operation report
- Generate sprint-based incremental service report
- Generate sprint-based incremental warn & alert report
- Generate sprint-based incremental knowledge database report
- Generate sprint-based incremental data dictionary report
- Generate project-based management report
- Generate service-based metric report
- Generate service-based support report
- Generate customer-based support report
- Generate team-based workload report
- Generate team-member-based workload report
我戏称其为杂七麻八(Misc.)平台,首先是因为他包含了工作中的几乎全部内容,再者是因为他是用很多种语言拼凑在一起写的:主体Python + Django,JavaScript什么的就不说了,还掺杂着Shell,Batch,Java,C#,PowerShell,Node.js,Groovy,还有IaC(Infrastructure as Code)的Terraform和AWS Native等等,总之是为达目的无所不用其极。
最终结果就是:
- Sprint Workload更加精准,质量和速度保持在很高的水准之上
- Report Automation使我们不用再为Report头疼,定期自动生成的各种Report,自动发送给个人后看一眼没什么大问题就直接Forward就搞定了,一切以Data + Dashboard说话,一眼专业
- 完备的Issue、Bug、Vulnerability Tracing
- 完善的Knowledge Database Documentation,通过HotSpot生成FAQ减少了很多Upstream/Downstream的Support Workload
- Expected vs. Actual,通过自动对比实际Workload和估算Workload为Retrospective提供详细提升建议
- Quality & Test & Release Report为Retrospective不断提供提升Release Quality的建议
- Metrics Aggregation + Performance Baseline + Auto-Script直接自动将很多Potential P-Level Incident直接扼杀在初期
- 和AWS Lambda + AWS AutoScalingGroup相结合,能够脱离传统的CPU + Memory + Disk的Monitor而去考虑实际Business需求,更精准地Scale In/Out
- Data Dictionary不但成为了我们每个Sprint的customer-based产出成果Report,还成为了公司内部著名的wiki-dict



发表回复