Senior Site Reliability Engineer, Principal @ Altais - Oakland, CA

Job Overview

4 months ago

Senior Site Reliability Engineer, Principal

Altais - Oakland, CA

About Our Company
At Altais, we're looking for bold and curious innovators who share our passion for enabling better health care experiences and revolutionizing the healthcare system for physicians, patients, and the clinical community. Doctors today are faced with the reality of spending more time on administrative tasks than caring for patients. Physician burnout and fatigue are an epidemic, and the healthcare experience and quality suffer as a result. At Altais, we’re building breakthrough clinical support tools, technology, and services to let doctors do what they do best: care for people. Come join us as an early member of our passionate and growing team as we change the game for the future of healthcare and enable the experience that people need and deserve

About Your Team
Do you enjoy working with a highly motivated and talented team to deliver mission-critical healthcare solutions that change the way healthcare is delivered? Altais is growing our Site Reliability Engineering team to help deploy, manage, troubleshoot, and enhance our complex cloud-based services for our customers.
Do you want to push the limits on Amazon Web Services to drive value-based care? Using Athena, EMR, Kinesis, Redshift, Glue, MQ, Neptune, Greengrass, SageMaker. Kendra, Lex, Textract, PyTorch, TensorFlow, Transcribe, Polly, and Macie.
We are looking for a highly technical, hands-on Engineer with experience using several open-source projects commonly found in large-scale deployments. You will be managing our Kubernetes Lifecycle: deployments, upgrades, monitoring, and uptime of all K8S clusters. You will help to advance the deployment process of software into Kubernetes with GitLab at massive scale. Additionally, you will work towards perfecting the metrics and alerting from Datadog and Pagerduty so that all events are actionable.
Your focus will be on maximizing system uptime. Team members all participate in an on-call rotation.
You will build innovative automated solutions and tools to help debug and resolve problems in production and prevent them from recurring. Further, you will proactively seek out system weaknesses and find ways to fix them before they cause production issues using monitoring data, watching trends, and using Chaos Engineering.
This position is located in our brand-new Oakland City Center location.

About Your Work
  • Keeping your assigned site or service up and running or getting it back up and running quickly when a failure occurs.
  • Automating work including infrastructure needs, testing, fail-over mitigation, and much more.
  • Developing CI/CD processes to improve cadence.
  • Working closely with internal partners and teams to ensure that we ship software that meets security, SLA, and performance requirements.
  • Debugging complex problems across an entire stack and creating solid solutions.
  • Post incident-reviews to find out what’s working and what’s not and improving them by filling the gaps in the process.
  • Writing, updating, and user documentation, including runbooks/playbooks.
  • Using Chaos Engineering to test what you build under real-world conditions.
  • Running monthly Chaos Engineering “Game Days”.

The Skills, Experience & Education You Bring
  • 10 years of experience with software engineering, software development, or system operations
  • Experience designing, building, and operating large-scale production Software-as-a-Service platforms
  • Experience with monitoring and observability such as with Datadog and Prometheus
  • Production experience with DevOps or site reliability engineering running web and/mobile applications
  • Excellent communication skills, both verbal and written
  • Advanced experience on Terraform and/or (Optional: CloudFormation)
  • Hands-on experience with AWS cloud platform (Optional: GCP or Azure)
  • Experience debugging complex problems, including application running on kubernetes platform and EC2 instances
  • Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals
  • A solid understanding NodeJS and Java
  • Moderate understanding on how database works, writing queries to interact with databases, and troubleshooting complex data layers. Open-source databases (MySQL, Postgres, Redis, Cassandra, etc.)
  • A solid understanding of networking and core Internet protocols (e.g. TCP/IP, DNS, SMTP, HTTP, and distributed networks)
  • Understands networking and messaging, especially between services
  • Has hands-on experience using source control (Git, GitHub, GitLab) and feature branching strategies
  • Have a track record of embedding security into the fabric of an organization and infrastructure.

You Share our Mission & Values
  • You are passionate about improving the healthcare experience and want to be part of the Altais mission.
  • You are bold and curious- willing to take risks, try new things and be creative.
  • You take pride in your work and are accountable for the quality of everything you do, holding yourself and others to a high standard.
  • You are compassionate and are known as someone who demonstrates emotional intelligence, considers others when making decisions and always tries to do the right thing.
  • You co-create, knowing that we can be better as a team than individuals. You work well with others, collaborating and valuing diversity of thought and perspective.
  • You build trust with your colleagues and customers by demonstrating that you are someone who values honesty and transparency.

Additional Information

  • About Company: Looking for a chance to do meaningful work that touches millions? Come join the hardest working, nonprofit health plan in California and help us shape the future of health care. Altias is focused on transforming health care by making it more accessible, affordable and customer-centric. Being a mission-driven organization means we do much more than serve our 3.5 million members: we were the first health plan in the nation to limit our annual net income to 2 percent of revenue and return the difference to our customers and the community, and since 2005 we have contributed more than million to the Altias Foundation to improve community health and end domestic violence. We also believe that a healthier California begins with our employees, so we provide them with resources to develop and maintain a healthy lifestyle through our award-winning wellness program, Wellvolution. We're hiring smart thinkers and doers who want to work for a leader and innovator in the challenging, ever-changing healthcare space. Come and help us make health care better for everyone.
  • Physical Requirements:
    Office Environment - roles involving part to full time schedule in Office Environment. Due to the current public health emergency in California, Blue Shield employees are almost all working remotely. Based in our physical offices and work from home office/deskwork � Activity level: Sedentary, frequency most of work day.
    Please click here for further physical requirement detail.
    "
  • EEO Footer:
    External hires must pass a background check/drug screen. Qualified applicants with arrest records and/or conviction records will be considered for employment in a manner consistent with Federal, State and local laws, including but not limited to the San Francisco Fair Chance Ordinance. All qualified applicants will receive consideration for employment without regards to race, color, religion, sex, national origin, sexual orientation, gender identity, protected veteran status or disability status and any other classification protected by Federal, State and local laws.
    COVID-19 update: From the earliest days of the pandemic, Altais, Blue Shield of California's subsidiary, has been unyielding in our commitment to putting the health and safety of our people. As a federal contractor and a health care company, Altais requires all employees to be fully vaccinated prior to start date as a condition of employment and provide proof of vaccination status. Altais will consider requests for medical or religious accommodation to this vaccination requirement prior to your start date.
    The definition of 'fully vaccinated' is 14 days following the final dose of a COVID-19 vaccine. If you are unable to be fully vaccinated by your start date, your start date will need to be postponed and you will have 30 days to remedy. If you cannot fulfill the requirement nor obtain an accommodation within 30 days, your offer will be rescinded.
  • Posting Date: Jan 25, 2022
  • Schedule: Full-time

Similar Jobs

Site Reliability Engineer, Application Security

BYTEDANCE PTE. LTD.

Marina, CA

Consistently evolve systems by pushing for changes that improve reliability and velocity. Work with product engineering team on system design, software…

Senior Site Reliability Engineer *** Linux, Python, Ansible ***

Zscaler

San Jose, CA

Design and deploy various customer facing Linux and BSD based systems infrastructures. Create and deploy scalable monitoring systems for massively growing…

Site Reliability Engineer, Infrastructure Engineering

BYTEDANCE PTE. LTD.

Marina, CA

Supporting end-to-end to production environment by responding to performance and reliability issues and participating in rotational on-calls.

Senior Site Reliability Engineer

Course Hero

Redwood City, CA

Automate operations to improve reliability of customer facing applications. Discover, evaluate, and implement new technologies or services to continually…

Senior Site Reliability Engineer

Life360

San Francisco, CA

You'll use automation tools as often as possible, and develop and improve these tools. You are comfortable dealing with very large amounts of traffic to the…

Staff Site Reliability Engineer-1464910

JPMorgan Chase Bank, N.A.

Palo Alto, CA

As a Site Reliability Engineer (SRE), you'll help build a meaningful engineering discipline, combining software and systems to develop creative engineering…

Site Reliability Senior Engineer

Oracle

Santa Clara, CA

Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Participate in the team’s on-call rotation.

Staff Site Reliability Engineer- Data

JPMorgan Chase Bank, N.A.

Palo Alto, CA

Experience in site reliability engineering in one of the following languages: Python, Java, shell scripting, PowerShell or GO.

Staff Site Reliability Engineer-Cloud

JPMorgan Chase Bank, N.A.

Palo Alto, CA

The public cloud team is responsible for engineering and operating the cloud infrastructure and platforms of JPMC ensuring reliability, resiliency, and security…

Engineering Manager, Site Reliability and Security

Credit Karma

Oakland, CA

Partner with key stakeholders to drive the scalability and reliability of platform services. 3+ years managing, growing, and developing engineers responsible…

Senior Site Reliability Engineer

Pearson

Sacramento, CA

This role requires a generalist who can contribute with needs in development, system operations, infrastructure as code, automation, observability, security…

Site Reliability Engineer-Data

JPMorgan Chase Bank, N.A.

Palo Alto, CA

Experience in site reliability engineering in one of the following languages: Python, Java, shell scripting, PowerShell or GO.

Site Reliability Engineer-Cloud

JPMorgan Chase Bank, N.A.

Palo Alto, CA

As an SRE you have the responsibility of ensuring the reliability, availability, and performance of the cloud infrastructure and platform.

Sr. / Staff Site Reliability Engineer - Cloud Infrastructure

ServiceNow

Santa Clara, CA

Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design.

Senior Site Reliability Engineer

Hive

San Francisco, CA

Our unique machine learning needs led us to open our own data centers, with an emphasis on distributed high performance computing integrating GPUs.

Sr DevOps Engineer (Ops Team)

Informatica

Redwood City, CA

A background in Devops, platform engineering, site reliability engineering, systems administration, or software development/qa.

Junior Site Reliability Engineer

Ankr Network

San Francisco, CA

Someone who grabs on to problems, mitigates their impact and then resolves the underlying issue. Someone who likes to operate, secure and orchestrate Linux…

Site Reliability Engineer

Volto Consulting

Sunnyvale, CA

Identify ways to automate and improve release, deployment processes. Develop tools to integrate with monitoring systems - Work with the Development teams, SREs…

DevOps engineer

techkuber

San Jose, CA

Site Reliability Engineers are hybrid systems and software engineers who are responsible and take ownership for reliability, scalability, automation, and other…

Sr. Dev OPS Engineer (Location: Sunnyvale, CA, Santa Barbara, CA or Norcross, GA) - W2 Only

Intuitive

Santa Clara, CA

The ideal candidate will have proven experience in Devops engineering within a regulated environment, support small to mid-sized development teams, and success…

Sr. Site Reliability Engineer

Supernal

Fremont, CA

Think about systems – their edge cases, failure modes and life cycles – and how to improve the long-term reliability, and scalability of our infrastructure.

Dev Ops Engineeer

Amick Brown, LLC

Santa Clara, CA

The ideal candidate will have proven experience in Devops engineering within a regulated environment, support small to mid-sized development teams, and success…

Site Reliability Engineer - Opportunity for Working Remotely Palo Alto, CA

VMware

Palo Alto, CA

We need someone who can design, build, analyze, and improve distributed systems. You will maintain services once they are live by measuring and monitoring…

Site Reliability Engineer

Pathlight

San Francisco, CA

In this role, you will be responsible for improving the reliability and performance of the Pathlight web and mobile apps, along with improving the internal…