Rethinking an Architectural Design
- 5 minutes read - 1065 wordsAt a previous (unnamed) company, I had the privilege of working on a core search project. The goal of this project was to simplify the architecture and reduce the latency in updating the search index based on user actions. In retrospect, this project could have gone a lot better if we had made use of tools and patterns that I learned after leaving the company. I think the team did the best with what we had available to us.
The Domain
This search index was made up of Items that needed to be searchable by several attributes such as name and whether or not a user selected them. There are many more attributes of an Item that were searchable but are not relevant to this discussion. These Items could be a member of 0 or more Collections. Collections could also be selected by a user and the selections of this Collection did not propagate to the Item. Collection must be searchable by name as well. A Collection was required to have at least 1 Item but had no limitation on the number of Items in the Collection. Not having a limit led to several ongoing issues which I will get into later. Users could also have a number of Items in a Collection deselected.
Architecture of the Solution
As you can see in the diagram above, there already existed a Monolithic Database which had many services connected to it. We then used Debezium to read the WAL log and publish messages for each table change on a subset of tables. We used Amazon SQS as the queuing service. By table changes, I mean INSERT
, UPDATE
, and DELETE
.
The indexer would then pull messages from the queue and update the Search Index accordingly. The search index looked something like this:
{
"id": "123",
"name": "Item 1",
"collections": [
{
"name": "c1",
"id": "345",
"selections": [
"444",
"777"
],
"deselections": [
"99"
]
},
{
"name": "c2",
"id": "678",
"selections": [
"444",
"777"
]
}
],
"selections": [
"444",
"777"
]
}
By far, the most frequently updated part of the index was the selection of Collections and Items. Due to SQS not guaranteeing order, the Indexer would issue a select query to check if the upsert it was about to make to the index was the latest change. If it was not the latest change, it would pull the data for the entire document and overwrite the document in the index. Then there was also a Search service that would handle search requests made to the index.
Drawbacks
Massive Collections
Due to the fact there was no limit on the size of a Collection there were Collections that had upwards of 1 million Items in it. These Collections were selected by a large number of users. There where many late night alerts caused by this but Product would not relax this constraint. The core issue was that for each selection change in a Collection, every Document in the Collection had to be updated in the search index.
Maintaining User Selections in the Search Index
The fact that User selections where maintained in the search index increased search index updates by orders of magnitude. The volume was high enough that the search engining rarely had time to clean up stale document segments. This impacted query performance and storage volume.
Monolithic Database
This database had a very large number of services connecting to it. Tables could be updated and read from by multiple services. This made debugging and incident investigate difficult. Only a hand full of DBAs could help us find the culprit service, query, or internal cron job. It was effectively a black box. Most queries required db functions to be created in a mono repo. Most release required a change to the db (addition of functions new indexes etc). This lead to release trains at off hours and 1 hour + build times. We had this limitation while the company pushed for CI/CD.
Out of order messages
Every update to a subset of tables would trigger at least one external read. These reads tend to be cheap but drastically increase the volume of requests made over the network. More network usage means more cost.
Alternative Solution
My alternative solution would be to break out User Selection of Collections and Items to its own service with its own database. Items and collections would be managed by an Item-Collection service.
All changes in both services would result in a message being published to an Outbox Table. This table would then be read by Debezium and published to Kafka.
The CUD (Create Update Delete) message from the Item-Collection service would look very similar to what is published to the Search index. Other Kafka Topics could be created to reformat these messages. For example, we may only want Item or Collection data in a given topic.
User selections would be pulled by the Search service and added to the query submitted to the Search Index.
If possible, I would limit the number of Items in a Collection to 10k or some other number after measuring how long it takes for selections to be updated.
Advantages
Reduced Index Size and Number of Updates
Now that User selections are not maintained by the Search index, it is no longer required to “listen” for User related updates.
No more Monolithic Database
- A single service is allowed to access the database, making it easy to attribute changes
- No more release trains
- Far fewer Database migrations
- Migrations can be managed by the service
- DBAs are woken up less due to distributed responsibility and increased debugibility
In Order Messages
This allows the indexer not to constantly hit the source of truth for the most up-to-date document. It also reduces the time it takes for the Search Index to be updated.
Reduced Resource Utilization
We’d likely be able to scale down the Search Index (reduce machine size and count), but we wouldn’t be able to confirm that without performance tests. There would be an increase in the number of databases, but these could be far smaller in size and can scale up and down to meet the load (manually).
No more Large Cascading Changes
Since the Search Index does not care about User changes and we’ve limited the size of the packages, there will be no more massive changes slowing down document updates. All changes are bounded.