Qualys - Raleigh, NC

posted 3 months ago

Full-time
Raleigh, NC
Professional, Scientific, and Technical Services

About the position

As a Site Reliability Engineer for the Cloud Platform, you will play a crucial role in the full lifecycle development of cloud platform services. This includes everything from inception and design to deployment, operation, and continuous improvement of these services. Your work will be performed in FedRAMP environments, which necessitates that you are a U.S. Person, including U.S. citizens, nationals, lawful permanent residents, asylees, or refugees. You may also be required to perform work that is restricted to U.S. citizens on U.S. soil. In this position, you will focus on increasing the effectiveness, reliability, and performance of cloud platform technologies. This will involve identifying and measuring key performance indicators, making automated changes to production systems, and evaluating the results of those changes. You will support the cloud platform team by engaging in system design, capacity planning, and automation of key deployments. Additionally, you will help build a strategy for production monitoring and alerting, and participate in the testing and verification processes. Your responsibilities will also include ensuring that cloud platform technologies are properly maintained by monitoring availability, latency, performance, and overall system health. You will advise the cloud platform team on improving system reliability and scaling based on demand. As part of the development process, you will support new features, services, and releases, taking ownership of the cloud platform technologies. You will develop tools and automate processes for large-scale provisioning and deployment of these technologies. Participation in an on-call rotation is expected, where you will lead incident response efforts and contribute to writing detailed postmortem analysis reports that are candid and constructive. You will also propose improvements and drive efficiencies in systems and processes related to capacity planning, configuration management, scaling services, performance tuning, monitoring, alerting, and root cause analysis.

Responsibilities

  • Co-develop and participate in the full lifecycle development of cloud platform services.
  • Increase the effectiveness, reliability, and performance of cloud platform technologies by identifying and measuring key indicators.
  • Support cloud platform team before technologies are pushed for production release through system design, capacity planning, and automation of key deployments.
  • Ensure proper maintenance of cloud platform technologies by measuring and monitoring availability, latency, performance, and system health.
  • Advise the cloud platform team to improve the reliability of systems in production and scale them based on need.
  • Participate in the development process by supporting new features, services, and releases, holding an ownership mindset for cloud platform technologies.
  • Develop tools and automate processes for large-scale provisioning and deployment of cloud platform technologies.
  • Participate in on-call rotation for cloud platform technologies, leading incident response and writing detailed postmortem analysis reports.
  • Propose improvements and drive efficiencies in systems and processes related to capacity planning, configuration management, scaling services, performance tuning, monitoring, alerting, and root cause analysis.

Requirements

  • 4+ years of relevant experience in running distributed systems at scale in production.
  • Expertise in one of the programming languages: Java, Python, or Go.
  • Proficient in writing bash scripts.
  • Good understanding of SQL and NoSQL systems.
  • Good understanding of systems programming (network stack, file system, OS services).
  • Understanding of network elements such as firewalls, load balancers, DNS, NAT, TLS/SSL, VLANs, etc.
  • Skilled in identifying performance bottlenecks, identifying anomalous system behavior, and determining the root cause of incidents.
  • Knowledge of JVM concepts like garbage collection, heap, stack, profiling, class loading, etc.
  • Knowledge of best practices related to security, performance, high-availability, and disaster recovery.
  • Demonstrate a proven record of handling production issues, planning escalation procedures, conducting post-mortems, impact analysis, risk assessments, and other related procedures.
  • Able to drive results and set priorities independently.
  • BS/MS degree in Computer Science, Applied Math, or related field.

Nice-to-haves

  • Experience with managing large scale deployments of search engines like Elasticsearch.
  • Experience with managing large scale deployments of message-oriented middleware such as Kafka.
  • Experience with managing large scale deployments of RDBMS systems such as Oracle.
  • Experience with managing large scale deployments of NoSQL databases such as Cassandra.
  • Experience with managing large scale deployments of in-memory caching using Redis, Memcached, etc.
  • Experience with container and orchestration technologies such as Docker, Kubernetes, etc.
  • Experience with monitoring tools such as Graphite, Grafana, and Prometheus.
  • Experience with Hashicorp technologies such as Consul, Vault, Terraform, and Vagrant.
  • Experience with configuration management tools such as Chef, Puppet, or Ansible.
  • In-depth experience with continuous integration and continuous deployment pipelines.
  • Exposure to Maven, Ant, or Gradle for builds.

Benefits

  • Equal Opportunity Employer
  • Commitment to building an environment characterized by respect for the individual
  • Reasonable accommodations for qualified individuals with physical or mental disabilities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service