Site Reliability Engineer SRE
Company Overview:
Cantor Fitzgerald is a leading global financial services firm specializing in investment banking, capital markets, institutional equity and fixed income sales and trading, commercial real estate, and prime brokerage. With a legacy of over 75 years of financial innovation and integrity, Cantor operates across major financial centres worldwide, delivering excellence and trusted expertise to its clients.
About the Role
We are seeking a skilled and proactive Reliability Engineer to join our Messaging team, responsible for the stability, performance, and scalability of enterprise messaging platforms built on Solace PubSub+ software and appliances.
This role focuses on maintaining highly available, low‑latency messaging infrastructure supporting mission‑critical systems across both production and non‑production environments. The successful candidate will play a key role in operational reliability, observability, capacity planning, and continuous improvement, while also gaining exposure to proprietary messaging APIs and platforms.
Key Responsibilities
Administer, maintain, and support Solace PubSub+ appliances and software brokers across on‑premises and cloud environments
Provide production support for messaging‑related incidents, including root cause analysis and permanent remediation
Monitor system performance and availability using Prometheus, InfluxDB, and Grafana, proactively identifying and resolving issues
Configure, optimise, and support Solace deployments across WAN environments, ensuring secure, low‑latency message delivery
Collaborate closely with development, application support, and infrastructure teams to troubleshoot message flow and integration issues
Own capacity planning, scaling, and performance tuning of the messaging platform
Automate routine operational tasks and contribute to continuous improvement of reliability processes
Build and maintain monitoring dashboards, alerts, and metrics to provide deep visibility into messaging systems
Produce and maintain high‑quality documentation, including runbooks, topology diagrams, and configuration baselines
Support proprietary messaging APIs and components using C++, Java, Python, and C#
Provide support for proprietary caches and gateways integrating applications with the messaging layer
Skills & Experience Required
Minimum 3+ years of hands‑on experience administering Solace PubSub+ messaging systems in an enterprise environment
Strong background in production support, ideally within a 24x7 or high‑availability environment
Solid understanding of distributed systems, WAN networking, latency management, and failover strategies
Proven experience with Prometheus and Grafana for monitoring and alerting
Strong troubleshooting skills related to message delivery, persistence, and topic routing
Experience with capacity management, performance tuning, and scalability of distributed platforms
Good knowledge of Linux/Unix operating systems
Scripting and automation skills using Bash and/or Python
Excellent analytical and problem‑solving skills with strong attention to detail
Clear and effective communicator, comfortable working with multiple technical teams
Desirable Skills & Experience
Experience with containerisation technologies such as Docker and Kubernetes
Familiarity with other messaging platforms (Kafka, RabbitMQ, IBM MQ)
Exposure to DevOps practices and CI/CD pipelines
Experience with cloud platforms such as AWS, Azure, or GCP, including cloud‑native Solace deployments
Personal Attributes
Highly motivated, proactive, and ownership‑driven
Comfortable working in a high‑availability, mission‑critical environment
Strong collaborator who works well across teams
Methodical, organised, and capable of handling multiple priorities
Curious and eager to learn new systems and technologies
Calm and effective under pressure
Why Join Us?
Work on low‑latency, high‑throughput messaging systems supporting mission‑critical trading and enterprise platforms
Join a highly skilled, multi‑disciplinary engineering team
Opportunity to work with a broad and modern technology stack
Further develop both infrastructure reliability and programming skills in a complex environment
Site Reliability Engineer SRE
Site Reliability Engineer SRE