The project I’m involved in at work is easily the largest I’ve worked on to date, spanning almost 2 years, and involving around 100 developers. Without going into all that much detail, I can say that the purpose of the project is to build a new platform for current and next generation telephone base stations, which, as far as I’ve gathered, are the stations located all around the country, that let my mum phone me and ask why I never call.
The part that I’ve been most involved with has been the design and implementation of the upper layers of the communication stack, i e the interface that a fair percentage of the other developers have to come into contact with when they want their components to talk to other developer’s components.
I’ve worked on a few different libraries and APIs before, so I really didn’t think this would be that different, but the sheer scale of things meant that oversights that previously only required a few minutes and a quick chat to fix, now could incur several weeks of planning, and slews of angry mails when new releases suddenly broke lots of, previously working, test cases.
After I while I noticed that the general issues mostly fell into a handful of categories, and that writing these things down while they were still fresh in my mind would do me good for coming endeavors. I also realized that I haven’t blogged that much lately, and that this might make a good post!
It’s your API, so it’s your fault
We are a very diverse group of developers on our project, with varying background and varying experience. This meant that if you released something with a somewhat diffuse interface, then there would be as many different usages as there were developers.
At first I was annoyed at this, and naturally expected my co-worked to be clairvoyant and be able to get what I _really_ meant, but after an (embarrassingly long) while, I came to grips with the fact that it was my responsibility as the library writer to make sure that people where using my interface in the way I actually had intended.
This lead to me seeing unintended usage as a game, where the goal was to get people to do what I wanted, and when they didn’t, try to figure out why, and how I could more clearly convey my intent. This means things like consistent and clear naming and making sure that you say const when you mean it.
Once it’s in the .h file, it’s free game
In the book “Refactoring” it’s called a published interface, which is exactly what it is. Once you’ve written something in a header file and released it, that file is from that point on free game to use any way people see fit. This means that you need to put even more thought into everything you publish, and really consider if the things you expose in your header file really need to be exposed, or if you can add some form of indirection, or otherwise make them opaque.
It’s no surprise that people hate breaking API changes, and when the project grows in scale, the negative impact of a change that breaks an API might actually be bigger than the positive effects of that change (this was hard for me to come to terms with, as I’d never worked on such a large project before, where a breaking API change meant man days of work for the whole project to update, and instead just thought that I could make whatever changes I saw fit, as long as they made something better).
We tried very hard to avoid exposing any hard coded values, i e #define SUCCESS 0 in the public headers, as this could easily lead to cases where people just compared against 0 instead of the define, and instead supplied Win32-type macros, IS_SUCCESS(x) that tested if the success bit was set or not.
In one part we weren’t as careful, and supplied something like #define FUNKY_SYSTEM_ERROR 1. This worked fine for a while, but then we wanted to start adding more detailed error codes, i e #define FUNKY_SYSTEM_MEMORY_ERROR 2 etc, which turned out to be a pain, as there were already a lot of compares against the generic FUNKY_SYSTEM_ERROR value in client code.
Add validation code first
We wanted to get our API out quickly so, even if the underlying functionality wasn’t actually there, people code at least compile against it.
This went well, and after a while, we got round to adding real functionality in place of the stubbed code, and also adding in validation code (asserts and the like). This is when we realized that even if you only write stub functionality, make sure that you’ve written the code that tests the correct values on the inputs.
People don’t have a problem at all with making sure that they give you the correct values (especially if they’re documented), but adding validation code later in the project annoys people, because it breaks something that’s already working.
Adding extra validation code late in a project that makes peoples code fail is a good idea in the ideal world, but not always that popular in the real one!
Hard coding, or limiting the freedom isn’t always a bad thing
In our network, we had a predefined topology, but we realized that this could be taken advantage of very late in the project.
Instead we had an API based upon ids that were strings, where both parties trying to talk to each other needed to enter the same strings in several places. This naturally lead to a lot of typos, and because we allowed arbitrary strings as inputs, we didn’t know if we were getting handed the correct string or not, so we had no way of validating the input. We toyed with the idea of having a global table of allowed string values, but there wasn’t any centralized list of what was actually allowed, so this would be very hard to maintain.
Lots of errors due to typos could be avoided if we instead had (for example) generated a header file containing enums that should be used as ids instead, so that the scope of inputs would have been shrunk down to stuff that actually made sense. The “template file” used to generate the header file could also have been used to generate the parts of the documentation that described the topology.
In the end we partially solved this by supplying a bunch of helper functions of the form “a_talk_to_b” that clients could use. These functions would in turn make the connection calls with string parameters, that had been heavily scrutinized, and were hopefully correct.
Who are you writing your error codes and messages for?
This is tricky, and we didn’t really figure out a good strategy other then “hmm, we need to think more about this”.
Is an error specifying that a pointer to some internal structure is NULL really useful for the person using your code? Probably not, as he’d rather have something more descriptive that tells him why he’s getting it, and how to avoid it. Perhaps an explanatory sentence first, and then the hard core debug info for you to look at?
All along the project we had a wiki, that started off very empty, but grew over time, as we realized that this was a great place to write documentation and guides for our system. To help with the “cryptic error messages” problem, we had a troubleshooting section that listed the actually error message that you got (i e “bla bla pointer is NULL”, which is what the test framework would say if an assert triggered), and then a short paragraph saying what this meant, and the common causes.
A living documentation
A wiki based user guide turned out to be a very good idea, but it needs to reach some sort of critical mass so that people start going to it for help, and they need to be encouraged to poke you when they can’t find what they’re looking for, so that you in turn update it, and people confidence in the documentation can actually start to build.
At one point it almost seemed like spamming, with constant replies of “have you read the wiki?” when people came with questions that were answered there, but over time people’s faith in the documentation grew, and I couldn’t help but feel a little proud as I walked by developers sitting debugging using my documentation as a sort of reference.
After a while people are going to read your documentation, and it’s by having up-to-date documentation that people can trust your code. You are going to be in the situation when you’re debugging something strange, and the person having the error says “but the documentation says XXX”. At this point, you can’t just say “yeah, but I wrote it, so I know it’s really YYY”, so it’s important that once people start reading the documentation that you also keep it up to date.
And finally, something not really related to API design, but a fact that I learned the hard way.
Tools take time, and code generation is hard
I started out thinking that I could just whip up some advanced code generation scripts and what not in an afternoon (“but, but, Python’s template functions are so powerful!”), but a couple of failures made me realize that making things production quality does take time, so you’re best off telling your boss what you’re about to attempt, and what the benefits will be. Both so he knows that you’re not goofing off, and so that you can take some time to debug your stuff when it malfunctions later on.
Sometimes just talking to someone else will make you notice that you actually just want to try something because it sounds like fun, and that you’re just solving a relatively simple problem, that already be solved and proved to be working several times over!
There’s a related blog post called The mythical man weekend that’s a good read on how easy it is to underestimate the amount of work needed to get something ready for mass consumption.
Puh, I guess that’s about it for this time. Working on an API that was used by lots of developers (who are also your co-workers, and aren’t’ afraid to give direct feedback!) has been great fun, and given me a lot to think about, both from a practical coding point of view, and also from a theoretical and design point of view, and is something that I wouldn’t mind doing again, now that I feel I have a greater understanding of how to do it. At least I think I do :)
Continue reading