Hardware
Once upon a time I was a PC tech. It's really easy to identify where the boundaries are in hardware, because they have plugs and sockets. So for example, if a monitor is flickering, we start by plugging in a different cable. If that doesn’t fix it we try a different screen. If that flickers too, the chances are it's the graphics controller in the PC, so we replace that. If that doesn't solve the problem then we have to start looking for weird reasons like the time it was caused by EF interference from an industrial plastic moulding machine just on the other side of the wall...(this actually happened).
How do we take that "it's got a plug and socket" analogy into other kinds of problems?
Boundaries occur where different kinds of things join together.
Boundaries occur where something gets changed into something else.
Boundaries occur where it's easy to unplug it and try a different one.
Boundaries occur where multiple things join into a single thing.
Networks
I spent a good chunk of my early career working on networks. That extends the 'plug and socket' bit slightly further, because some of the bits are a long way away, but can sometimes have effects at a distance. Here's an example:
One day I went out to look at a machine that kept crashing. We had a heuristic for trouble shooting unexplained crashes on this particular type of machine, it was often caused by noisy communication cables.
Checking the noise levels, one particular connection looked problematic - thousands of logins to each logout.
The cable led to a modem, which led to a box on the wall. By asking around I found that the box was a connection to another office, which contained a remote terminal. We had another heuristic here - a lot of the terminals the customer used were Wyse 50, which had a known problem with crackling power switches. I called the remote office, asked them if their terminal made a crackling noise when they switched it on? “How did you know that? Yes, I daren’t touch it any more, I use a pencil to switch it on and off!”.
A quick call to the local engineer to pop in and swap the power switch (a 2 minute job) and the original problem with the crashing machine went away.
Another, similar incident:
One afternoon out of nowhere the entire network went down. Not just one office, all three sites. Looking at the ethernet hubs, they were flickering on and off much faster than normal.
First we isolated each building - literally just pulled the wire out of the hub.
Two buildings started working again.
One didn’t. Guess where our problem is then.
So we carried on isolating sections of the network. Pull out the cable that leads to the 3rd floor. Still broken. 2nd floor. Still broken. First floor, and things start working.
OK, it’s the first floor. On a hunch, how about the research lab? Yep, the problem’s in the research lab.
We pulled out the cables one by one in the research lab until the network traffic stopped. The faulty cable? It'd been unplugged and left on the desk, and was short circuiting on one of the screws holding the desk together.
What we can see in these two examples is the interplay of problem partitioning, and heuristics. A hunch about the cause of the problem helps us choose what area to isolate, and isolating the area tell us whether our hunch was right.
We keep narrowing down possible causes for the problem until we have it confined in a space we can check thoroughly.
Notice also in example 1 that we replaced the parts in the following order:
Cable
Screen
Graphics controller.
Why?
Well, because the cable was easy to replace. The screen itself is a little more unwieldy - it takes a little more effort to lift and move. Finally the graphics controller: the PC has to be partially dismantled to replace that, it’s the most effort to try so we take the easier options first.
Prioritisation in trouble shooting works like this: look at whether possible solutions are easy to try, and likely to fix the problem. Try the easiest-medium likelihood stuff first, then the most likely, then the easy long shots followed by the low likelihood, hard to test cases.
(this is just another heuristic - it's not guaranteed to be right but it's a reasonable starting point)
Code
All this applies to code too - isolate your problem in a small area where you can examine it closely.
Partitioning is a key skill in software development. By decoupling components and reducing their reliance on each other, what we’re doing is creating distinct boundaries between them. That helps us a lot when it comes to problem solving, as we can isolate our problems by component and fix them in a single place.
In code, we can also create test harnesses that interact with one or more modules using defined interfaces, which allows us to create minimal test cases that use the least amount of code (hence the least amount to go wrong!) to exercise a component we think might be the problem.
Good modularisation is one of, if not the most important skill to learn as a developer.
Networks, only a different kind…
So networks have natural boundaries - in graph terms there are edges, and there are vertices. Vertices can be grouped to form subsystems.
Analysing entire systems in terms of boundaries and relationships enables us to grasp the whole system, whilst still imposing some order on it.
That's the beginning of systems thinking.
No comments:
Post a Comment