Scaling Payments in Black Friday 2019
Last year I wrote a blog post (https://iyzico.engineering/scaling-payments-in-black-friday-2018-55fc44cd8a47) about our Black Friday’s scalability experience in Payments domain and the post got too many positive reactions. Thus I decided to write this year’s scalability experience at iyzico because I think that the technical writers should write much more about their production experience.
First Challenge: API Request Count Prediction
First, we should predict the hourly request count of Black Friday 2019 and the maximum transaction per second (TPS) of our APIs because we need to optimize throughput by finding and fixing bottlenecks in our APIs and by allocating infrastructure resources. Our data scientist Ömer Burak IŞIK and the data team made predictions by using Long Short-Term Memory (LSTM) architecture. Unlike standard feedforward neural networks, LSTM has feedback connections. For more detail: https://pathmind.com/wiki/lstm

The red and green dots represent some of the special discount days like Black Friday and the blue dots represent the ordinary days on the historical data graph.

Second Challenge: Finding and Fixing Bottlenecks
After prediction, we knew the maximum request count per second of APIs and the predicted time frames of related peaks. In production, the engineering team simulated the same request counts via load testing by using JMeter (http://jmeter.apache.org/) and found the bottlenecks. The major improvements that we implemented:
- Feature Toggling: Prioritized our APIs and Features and added feature toggling support for most of them so that we can easily manage load in case of emergency.
- Health Checks: Improved the health check algorithms of APIs by cleaning false-positives.
- Splitting Monoliths: Split some of monolith databases.
- Blocking Commands: Cleaned all of the blocking codes & commands (like KEYS in Redis) (https://www.compose.com/articles/mastering-redis-high-availability-and-blocking-connections/)
- Monitoring Tools: Improved existing monitoring tools and implemented the new ones including technical and business alerts.
- Circuit Breakers: Tested all circuit breakers between the microservices which communicate synchronously.
- Failover Tests: Tested all failover scenarios like latencies and downtimes on databases, cache servers, network tools..etc.
- Application Performance Monitoring (APM): Configured our APM tools to find the bottlenecks under heavy load and to make plans for 2020.
- Rate Limiters: Configured our rate limiters to decrease false-positive ratios.
Third Challenge: Planning the Day
Infrastructure and development teams updated the emergency policies and the business teams updated the communication policies together. Engineering and business teams organized a ”War Room” including too many dashboards. Thanks to the culture and the marketing teams for healthy food. :)

Conclusion
At the end, we almost doubled the daily transaction count of Black Friday 2018 and doubled the maximum payment count per hour without any latency or downtime. We also increased payment acceptance rates and conversion rates while managing the “all-time high traffic”.

Thanks to iyzico engineering team for scaling the technology and thanks to iyzico business teams for scaling the operations and for the growth of business.
References:
- https://martinfowler.com/bliki/CircuitBreaker.html
- https://github.com/Netflix/Hystrix
- https://martinfowler.com/articles/feature-toggles.html
- https://pathmind.com/wiki/lstm
- https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/
- http://jmeter.apache.org/
- https://www.compose.com/articles/mastering-redis-high-availability-and-blocking-connections/
- https://www.eginnovations.com/blog/what-is-application-performance-monitoring/

