We developers love to talk and moan about many aspects of our job. And we do it a lot. Any given day you’ll find plenty of posts, on HN or elsewhere, discussing the next big thing, the new framework that will heal the sick world of web development, ranting about hiring managers and ill-conceived management processes. And there’s nothing wrong with it: our job is awesome in many different aspects, but sometimes it’s not so great. I get it.
You won’t find a lot of information and opinion pieces about one fundamental part of our day to day life, though: how to deal with refactoring. Sure, we have a ton of classic books about the subject, and the most vocal proponents of TDD and XP will talk at great lengths about the overall importance of striving for clean code and how important it is to not forget the “refactoring” part of the canonical TDD cycle. I personally didn’t find as much information as I wanted, and what I’m really curious about is reading other people’s experiences: how do you deal with legacy code? What do you do when confronted with some large module you don’t fully understand? What do you look for? Where do you start from? I’d like to know more.
I’ll try to explain what I do in those cases. Be warned: it might not be the best way, it’s probably naive and inexperienced, and I’m sure there is a lot of room for improvement (I’m not a particularly gifted developer, nor do I consider myself to be intelligent, and that’s why I have to do anything I can in order to keep things manageable, straightforward and accessible when I touch them or write them from scratch). Nevertheless, it’s my attempt at describing my overall thought process. I’d love to get some feedback about it, so please be cruel and smash my method into pieces, if you think you can do better. But please, keep in mind that English is not my first language, and sometimes it may show in weird syntax or wrong orthography.
First things first: when you scroll through a module or class for the first time, your goal is to understand its purpose: code is (usually) there for a reason, nobody (well, almost) writes code if he can avoid doing so (we’re lazy bastards), so don’t rush to dismiss other peoples’ code as meaningless or useless. Use your tool of choice to look for occurrences and usages, look at the big picture but don’t be afraid of the minutiae. At the same time, don’t keep your sight on only one layer of abstraction: one of the bad things about legacy code is often the mixing up between different concerns and abstractions. Be ready to go up and down, left and right when you read code for the first time.
Tests: don’t even think about doing a major refactor without having a satisfactory suite of tests. Write your tests first, if not unit at least functional, to make sure you can go ahead with confidence and not break anything in the process. It really depends on your application, but usually you want to make sure that the outermost APIs work in the intended way, and keep going forward until you’ve established what the acceptable behaviour is for the largest set of inputs. If you find this too difficult, or if there is a significant amount of ambiguity about what something is supposed to do, stop here: You have a bigger fish to fry, and you should address the problem first and come back later.
Clarify intent: once you understand the purpose, you’ll find that something works in an unintuitive or obscure way. One thing I find useful is renaming variables or functions. Perhaps that function or method name was chosen a long time ago, and then someone decided to add a boolean flag to address a corner case. Don’t be afraid of using a silly long name or putting ands, ors, ifs in those names. If anything, it will help to remind you when a function does more than one thing and clearly identify areas that need work. Go around, explore, put comments wherever you want (writing code is cheap! Reading code is not!); you’re not supposed to share your comments yet, so be foolish as much as you want. You want to be able to read the code again as if it were a novel: if you read an article in plain English, isn’t it easy for you to spot obvious problems or logical inconsistencies? There is no reason your code can’t tell you the story.
Let your code breathe: this is somewhat controversial, but I really like to put a lot of space around blocks of code when first dealing with a module I want to refactor. It somehow helps me in drawing a mental map of what’s going on in the code, isolate logical units, and who knows? Sometimes those “blocks” between a couple of white lines will translate in a comfortable function call effortlessly. Move lines around, stick declarations near each other, try to group functions by their purpose: did you find more than one function that does the same thing slightly differently?
Divide: almost 99% of the time you’ll end up with functions “too large to follow and maintain”. I don’t know about you, but being not particularly intelligent myself, I have a hard time dealing with complexity. It’s much easier to build a block with several different tiny components than it is to do so with only one purposely built big component. Try to split them up into independent components, and keep doing that until you reach the ideal point where every function does exactly one thing. You can use your IDE’s tools to help in doing that: IntelliJ has a useful Refactor to method feature that tries to extract the selected text in a function, and automatically infers its input and output arguments; this allows you to spot really quickly the data structures that are taken and returned, and get a glimpse of shared state between them.
Isolate state: it’s a source of confusion. There’s nothing inherently wrong with state, a good deal of web applications are inextricably linked to variable state at some point, and it’s not always possible/effective to strive for complete immutability. What you want to do is isolate the state as much as possible, and reduce the amount of shared state and possible mutation in objects, leaving mutability in a few very obvious spots, where it’s easier to keep under control.
Write docstrings. Every language on earth will allow this with some standard / recommended way, please use them if you’re rewriting code. It may not be essential, but will provide some solid value to future developers digging through your code (and again, your IDE could be able to use them to provide context when you hover over a function/class).
Delete all the comments. Comments are, again, controversial: you should use plenty of them in your git history, and in your PR descriptions, when words are the only allowed way to communicate. But they have no place in code. 99% of the time, you think you need a comment because your code is not obvious or behaves in a bizarre way. But in code there must be no surprise. Especially if you’re using a very explicit and expressive high-level language like Python or Scala or Ruby, there should really be no need to comment. Delete all the comments, re-read your code and spot the places where it doesn’t make sense anymore. Change that to make sense. You’ll make yourself and your mates happier.
(There are exceptions to this rule: perhaps you’re doing something that seems wrong but it’s needed because an external API does the wrong thing, maybe you need to mangle the contents of a file because it’s expected to be incorrect as it is. Or maybe you’re describing a complex, long and difficult algorithm that only people from your field of expertise will recognise.)
Constantly test your changes: you might need fast tests for this, you don’t want to end up not running the tests because they slow you down (I’ve done this several times in the past, and I’m not proud of it). There is no need for a comprehensive set of tests to be slow at all, unless you’re doing things wrong or maintaining a very, very large enterprise application (and still: you shouldn’t need to run an entire test client of the application just to test if a function returns the expected values when given an input – integration and functional tests should live in different modules/folder, so that you can run them every once in a while to make you sure you haven’t introduced any regression – i.e. before committing).
Don’t optimise ahead of time: if you’re still coming to grips with the code and you’re busy understanding how all the pieces fit together, don’t be afraid to use inefficient data structures or waste loop cycles, or even write useless functions. You want to optimise at some point, yes, and this point comes always at the end of the process. Don’t be like the bazaar: you want your application to be solid, predictable and correct. Having it running faster and faster shouldn’t come at the expense of testability or correctness, ever.
Keep the story going: if you’re refactoring a piece of legacy code (and every application has some legacy code: every single piece of code eventually goes to rot – if you have some corners of the application you dread to touch, that’s legacy my friend), it usually helps people reviewing your code to be able to follow your thought process. So if you’re implementing a particular strategy or following a pattern, clarify that in your commits, even using descriptions fully: you want to end up with a series of commits that reads like a novel (again) and help people to glimpse your purpose without having to open the diff view in Github/Gitlab/whatever. Don’t be afraid to waste commits, they cost nothing to you (and you can always squash them together). Do the same thing when you’re addressing PR comments: one commit per remark/amendment, to acknowledge the action you’ve taken to address your fellow coder comment.
Never forget that code is a collective effort, and communicating is your best strategy, always. In software development, code is both the artifact and the instructions; coding is communicating. You’re speaking both to the machine, telling it what to do, and to your colleague or even future self. Don’t try to be super smart with clever one-liners that do not improve anything except your self-esteem. If there is a way to explicitly say something, say it explicitly: the machine does not value your cleverness, and every micro-second you might have shaved will cost you dear in terms of time and effort. If you think that CPU time or HTTP response time cost, think again at how much the equivalent developer time will cost to your business.
Last but not least: refactoring has nothing to do with reducing LoCs. Most of the time you’ll think that a 1300-line class/module is a mess, just because it’s long. Most of the time you’ll be right, but only because a lot of lines are a symptom of code repetition and incorrect use of patterns. If your program is in fact complex and deals with a lot of corner cases, you can expect it to grow longer as time goes by. It might even be the case that your refactoring will make the module/class grow: there’s nothing inherently bad about it, especially if your language of choice has a lot of boilerplate around basic object-oriented structures (I’m looking at you, Java). LoCs are simply not a good metric. I personally prefer a longish module that is correct and clear (long variable names, explicit iteration, vertical space) and does what it says it does, over a shortish module full of one-letter variables, obscure patterns and global variables being passed around.
Phew. That was long. I think I managed to explain a little bit of what I do and what my principles are when dealing with legacy code in dire need of a refactoring. But that’s not the point. What I want to know is: what’s your strategy? What are your principles to follow through a good refactoring job?