Job Description
* Engage with the development team throughout the application life cycle to ensure application are built with the concept of stability in mind
* Engagement at the Fleet and Squad level to promote SRE principals; adoption of SLI and SLOs, stability reviews to identify any improvement areas and guide the architecture of monitoring/alerting as part of business solution design.
* Coordinate with Dev, SRE and Automation / Instrumentation team to improve legacy monitoring and alerting workflow
* Interface with L1/Command center team to review and improve on alerting escalation and level 1 triage of events
* Participate and coordinate resiliency testing leveraging Chaos Engineering principals
* Perform monthly stability analysis to identify areas of improvement to reduce overall minor S5 incidents, manual data updates, manual reporting, and repetitive recurring tasks
* Interface with Business and Operations team to plan for business activities, communicate status of production issues and assess business impact for any production event
* Be able to perform triage of incidents across multiple environments and workflows, provide technical analysis and resolution action steps to resolve issue
* Conduct blameless post-mortems, and ensure permanent closure of the incidents
* Engage with wider Reliability Operations, WM and MS Site Reliability Engineering teams to stay informed of current initiative, participate in working groups and improve participation of IST ASG team in SRE initiatives
Qualifications
Required Skills :
* 5+ years of experience in enterprise software and proficiency in multiple languages preferably Java, Python, .net, Cobol, Shell scripting
* 3+ years of experience supporting or developing applications around order process, trade execution or portfolio management
* 2+ years of incident resolution experience in a large-scale operations environment on both Mainframe and Distributed environment
* 3+ years experience/knowledge with distributive web hosting services, databases and MQ processing. I.e. Tomcat, WebSphere, Microsoft IIS, Db2, MSSQL
* Working knowledge of FIX messaging protocol
* Experience working in an Agile Development environment
* Hands-on knowledge or certification in Site Reliability Engineering
* Proven ability to understand and troubleshoot complex problems under pressure
* Good working knowledge of Cloud Engineering. Understanding of private cloud principles and exposure to public cloud offerings such as AWS, Azure or similar technology is preferred
* Experience in performance engineering and monitoring using tools such as AppDynamics, Splunk, Apica, Jmeter, Grafana or Prometheus
* Mainframe experience of general Mainframe environment, TWS scheduling and MQ processing
Nice to have-
* Bachelors degree (or equivalent experience) in Computer Science/Engineering
* Excellent written and verbal communication skills