Preface: This post has originally posted on Facebook on May 26th, 2021. Allegedly, the developer vendor for this project awarded with almost USD 17 million (however, the minister said, it was only fraction. Hmm.. ?)
Edit 27/5: It is not a perfect system, done drawing and spontaneous idea in 30 minutes with disappointment and frustration. Might be good for elevator pitch. That is why a good team-mates is important to check and balance on the architecture. Sure got missing parts or blind spot.
Edit 28/5: Apparently on May 27th, the government decided to auto-register all previous MySejahtera (original Covid announcement app which developed previously on early phase of pandemic period.)
Subject: APPOINTMENT BOOKINGS FOR ASTRAZENECA VACCINE SLOTS (26 MAY 2021)
#CucukMyAZ
To engineer of vaksincovid.gov.my…
It become my attention, when I’m unable to register for Astrazeneca vaccine programme for above 18-years on May 26th.
I’m an engineer in one of the multi-awarded top online news portal establishment in Malaysia which serving millions pageviews per day, and millions simultaneous real-time website visitor during both GE & by-election tally hours.
On May 26th, around 12pm, I tried to get myself registered for vaccine, however stuck at PPV selection when clicking on state button.
Without thinking further, I open up Inspect function in Chrome, a little puzzled, “CORS issue on API server ?” Jump directly to the API endpoint GET, Cloudflare block the origin with Rate Limiting issue. I’m not sure if you using Free or Paid, but I believe, if you using Business Plan (which currently my organization using it), they do not have strict Rate Limiting. However, you should know, Cloudflare is a b**** sometimes. (I been using CF fo so many years, and know it’s characteristic, limitation and how the traffic routed).
For almost an hour, refreshing, submit, waiting, repeat for hundreds times, until you redirect the form back to homepage (to show the vaccination registration is closed).
Disappointed. I’m unable to register.
How on earth you did not anticipate of this issue ?
Well, what I’m about to share are some tips, to ensure it does not happen again, and disappoint me and millions of visitors;
- On critical endpoint, please DO NOT use proxy service like Cloudflare, Cloudfront, Akamai, Project Shield, etc etc. Call the API directly from your origin endpoint. Please note, I didn’t say “single server endpoint”, however, it should be “cluster endpoint”. Example, if you using AWS (which I’ve been using for so many years), the origin endpoint will be Elastic Load Balancer. Behind ELB, you should have a few actual instance/server running (NOT single server). Ensure your application is stateless.
For static information website & front-end, yes, please use CDN. Must use CDN. (I think you already smart enough to figured this out.) - DO NOT use live database on critical period of time. Regardless MySQL, MongoDB, SAP, or whatsoever to store the data during critical hour. Even though database can be optimize to get it fast with cache and everything, the throughput is bad, especially the database driver and translation process.
- Test, test & test. Before live on production, do load testing on your API endpoint. Do some rough calculation, estimate the visitor, and triple the number. There are tons of test suite online and offline (i.e: Apachebench), which can see how your API handle request. If you doing the load test, you will see Cloudflare will be your bottleneck.
- DO NOT overconfident and overpromise to your client/customer which paying you (in this case, Malaysian Government). Tell the truth what is suppose to expect and the mitigation plan if things go south.
- DO NOT under-provision on your infrastructure. I believe, the budget for the system is sufficient enough to have proper architecture, and you know how important this system to millions of people. Do over-provision (please do not sweet talk to client/customer that you can do cheap).
Well, I believe, this is 2nd time you guys flopped after the first registration for 60-years old AZ vaccine few weeks back.
This my idea which suppose you’ve done.
(see image attachment)
After website visitor fill up the form, send the data into 2 API service.
- API A : save the actual data
- API B : for tally calculation on each PPV
API A
The data sent to API A will be serve by ELB, behind ELB will have more than 8 instance resident (monitor this ELB and traffic for all instance, if the traffic is high, promptly increase the instance to 12 units, and so on). Why we need multiple instances, because each Linux process, they have file-open & process limit. The instance will have a basic web server (in this case, nginx). The web server will log the incoming traffic with POST data. (idtype, nric, msid, phone). So, the log file will have the actual data, which can process later.
The plan is, capture the data first, and do post-processing.
In post-processing, you can parse the web server log (with POST data), dump it into structured database / Excel / Google Sheet. Eliminate the duplication, do filtering and clean up the data on this process.
API B
This API is to tally how many registration have been done on each PPV, so you can return back the data to front-end to display the availability of PPV on specific date. You can see, there are a few component for the tally calc, including Redis service at the back to hold the data temporary. For Redis, use INCR command to do “i++” on the value. Again, DO NOT USE structured database at this point.
The figures / availability will be check by another instance and write the JSON output in S3 bucket and serve back to front-end.
Overall, your API will have robust stability and under 20ms response time.
Last but not least, have a good team mates. Good team to execute project is important. You cannot do it alone, check and balance is SUPER IMPORTANT.
Well, enough said.
(mic drop)