Your one stop Partner in Software re-Architecture - Cybinity
Background A leading enterprise approached us with a critical challenge "Our legacy job scheduling system has become a constant bottleneck for operations and hogs our innovation".
Reality Designed years ago, the scheduler had grown increasingly unreliable, resource-hungry and difficult to manage. The issues were severe. Frequent memory exhaustion was causing the system to crash. Jobs stuck in the queue were requiring frequent manual resets and restarts. Missed schedules were impacting time-sensitive operations. Overworked engineering teams were constantly firefighting the issues rather than focusing on innovation. Overall it was becoming a big liability for the team.
The Client's Ask "Help me build a robust and modern scheduler that is reliable and recovers gracefully from failures"
The Problems that need to be addressed Jobs often got lost after crashes. There were no retries, hence not reliable. Engineers had to restart the scheduler and retry failed jobs manually. There was a heavy dependency on the manual work and lacked automation. The system could not recover gracefully from memory issues. The system lacked fault tolerance. The scheduler could not accommodate to fluctuating workloads. The system lacked dynamic scaling.
Cybinity's Solution We designed a scheduler that is reliable, scalable and self-healing. The new architecture is modular, distributed and built for resilience.
Clients can schedule jobs via REST API endpoints. Jobs stored in Jobs Database. All timestamps are stored in UTC format, which eliminates time zone complexities. Poller continuously queries the jobs scheduled within a defined time window. Groups jobs into batches for efficient handling. Implements a look-ahead window strategy — polls upcoming jobs proactively to ensure they start on time without delays. Dispatcher moves the jobs to the execution queue (message broker) at the exact scheduled time. Execution Worker picks up job messages from the execution queue. The workers execute handlers (runs external services or code). Workers manage retries, failure handling and idempotency. Execution Queue is a durable queue ensuring no jobs are lost. It persists jobs until workers acknowledge execution. Retry Queue is used for automatic retries for recoverable failures. Irrecoverable jobs are moved to the Dead Letter Queue for manual inspection. Worker Manager monitors workers' capacity in real-time. It dynamically adds or removes worker nodes based on the demand. This optimizes the resource usage and hence the cost. Reconciliation Job is a periodic process that ensures no job is left behind. It picks up pending jobs missed due to outages and terminates long-running “stuck” jobs. Replay API allows clients to replay jobs in bulk or reschedule jobs within a date/time range. This saves operational teams countless hours in reprocessing workloads without any hassle.
Optimizations At-least-once delivery with idempotency Every job handler enforces idempotency using a dedupe key. Using this approach, clients can resubmit jobs safely — duplicates are detected and ignored.
Developer-Friendly Features Replay APIs empower clients to manage jobs without heavy manual involvement.
Self-healing system design Periodic auto Reconciliation ensures no job is permanently lost. Recon job picks up pending jobs missed due to outages and also terminates non-responding/ever-running jobs. Auto-scaling workers prevent overload during peak traffic and also help control the cost.
Dashboard and Alerting A simple dashboard displays the status of the scheduled jobs and the trends in the scheduled jobs. Alerts ensure appropriate notifications are sent when the jobs fail or get stuck executing forever.
Fault Tolerance & Chaos Testing Simulate DB, broker and worker failures to validate resilience. Replica DB must support for failover scenarios. Indexed queries will help faster lookups across large datasets.
Business Impact Cybinity's design translated the scheduler into highly reliable one, since there were no more missed or lost jobs. Achieved reduced operational overhead as the engineers were free from restarting the failed jobs manually. Also the teams could spend their time more on innovation. With the auto-scaling of the worker nodes, there was improved scalability as spikes were handled effortlessly, and the spend on the infrastructure to support the spike was minimum. Achieved real-time visibility into job states, retries, and failures. The design has converted the scheduler to be modular, cloud-native and ready for the future growth.
At Cybinity, we specialize in re-architecting legacy systems into modern, cloud-native platforms. Our approach blends deep technical expertise with a business-first mindset, ensuring our clients achieve not just a working solution but a solution that scales with their vision.