It is not unfrequent to hear about companies that have one or more applications in their software stack that is outdated. In an ideal world, this should not be the case, everyone wants to have efficient and updated software in their portfolio. This kind of application is usually called legacy.
What is a legacy application?
A legacy application or system is “An information system that may be based on outdated technologies, but is critical to day-to-day operations. Replacing legacy applications and systems with systems based on new and different technologies is one of the information systems (IS) professional’s most significant challenges. As enterprises upgrade or change their technologies, they must ensure compatibility with old systems and data formats that are still in use“ — gartner.com
Usually a legacy application just works, most of the times you don’t know why (hopefully someone in the company does), but it works.
Having legacy applications is a risk, but companies decide to take that risk because changing that piece of software requires too much effort.
A project I recently decided to pick up at work was about modernising one of our legacy applications. More in details, it was about updating the technology stack to allow it to scale better and being usable in more use cases in the company. With scale I mainly mean two things, (1) being able to ingest more data and (2) data with different characteristics with respect to the one it was originally designed for.
For confidentiality reasons I don’t want to disclose too many details about what the application’s purpose was and how I changed it to make it work at scale. But still I want that the reader grasps the most important bits of the situation and the solution, being general in my explanation. Hope this works.
The original application could ingest and process big amounts of data, very efficiently. This is true as soon as the space these data comes from is small enough (vague on purpose. Space here is general and can refer to multiple things: temporal space, geographical space, etc). The bigger the space, the most inefficient the process is. Moreover, there is a limit in the size of the space, above which the application simply does not work.
Our use-case instead was based on the main requirement (or constraint) that the input space is virtually infinite, or, in other words, orders of magnitude bigger than the original input space (100 to 10000 times).This is what I had to overcome.
How to scale a legacy application
I explored different possible solutions, from asking the people knowledgable about the legacy application to modify it for us, to modify it myself, etc. In the end, considering the limited amount of resources available, the lack of time, and the lack of knowledge, I decided to take another path: create a wrapper application around the legacy one to solve its limitations.
I started thinking at the problem, and had an intuition: since the main limitation with the legacy application for our use case is the input space, and it is working very efficiently on small input spaces, why not look into one of the easiest paradigm in computer science and algorithm design? I ended up basing my solution on divide-and-conquer.
“A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem” — wikipedia.org
Divide-and-conquer is composed by three main steps:
- Divide the problem into several smaller subproblems. Normally, the subproblems are similar to the original one.
- Conquer the subproblems by solving them recursively.
- Combine the solutions to the subproblems to get the solution to the original problem.
The use-case I was faced with was ideal for divide-and-conquer. I have a big input space, I split it into smaller spaces, each dealt as a subproblem, which I know the legacy application handles very efficiently, and then at the end I combine the results of all the subproblems in a smart way so that the sum of the solutions is the same as the solution of the initial problem.
So in the end, I wrote an application in Java that wraps the legacy application (is a dependency in maven). Such wrapper needs to (1) split the input space (divide) in a smart way, (2) execute the legacy application on each subproblem (conquer) and collect the results and finally (3) combine the results and serve them as a unique result to the caller. That easy.
Technical solution — Kubernetes
The explanation of the solution above seems something we hear frequently in computer science.
Modern frameworks are based on it, from Hadoop and its MapReduce (which is actually more than divide-and-conquer) to Apache Spark, etc. But remember that here we are dealing with a legacy application that we cannot change, so we cannot rewrite it to work on Spark.
Solution 1 — Modern technology
We could create the wrapper in Java to run on Apache Spark. This wrapper splits the big initial space by applying transformations and creating N partitions, one for each subproblem. Then, with .mapPartitions() you can execute the legacy app in Spark on each partition. And finally collect the results and apply transformations again to merge them. I wanted to give this a try, but on a second thought decided to discard this solution. In the end we don’t have control on the legacy application, we see it as a black box. In my case, a memory-hungry black box. I didn’t want to end up with an application running inside Apache Spark which gives me a lot of memory related problems.
Solution 2 — Kubernetes
What I found out to be a common practice in these scenarios is instead to use Kubernetes.
“Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.” — kubernetes.io
You wrap the legacy app in a docker image and deploy it as a deployment in Kubernetes and enable autoscaling. Autoscaling allows to create multiple instances of one running application (conquer), and the logic to do that can be based on resource usage like CPU and RAM or on other metrics, like incoming requests. Finally, you put a load balancer in front of it to handle increasing traffic and to redirect it to the available pods.
As part of the wrapper two additional components: a pre-processor (divide) that is in charge of splitting the problem into subproblems and a post-processor (combine) to merge the individual results of the subtasks.
My two cents learned from this experience:
If you have a similar problem, but you can write your application from scratch, go for modern technologies such as Apache Spark. If you cannot, use Kubernetes instead.
This is probably not the optimal solution on the long run. By having an entire team at disposal, with the appropriate resources, we could have probably rewritten the legacy application and achieve better results.
But considering all the constraints, this turned out to be an impressive result on the short term or as a temporary solution. In fact, this work enabled multiple new use-cases, bringing new stakeholders onboard, optimising operational costs (cloud infrastructure).
And, now that the solution is used and is gaining momentum, with more and more users, it could even happen that some new resources will be assigned to this project so that we can fully get rid of the legacy application and in the end what we did turns out to be the optimal solution even on the long run.