Let us consider how a software solution develops from scratch in a start-up. The process typically starts with an idea like “I want my users to be able to do
For example – the ToDo app allows you to create a task and then mark it as done. Some advanced versions might even feature an option to edit task descriptions and even delete them. So that’s a few functions. The functions solve a clear problem – the management of tasks.
A slightly more advanced design might include scheduling tasks and notifying yourself about deadlines and nudges. This will help solve a somewhat different problem of driving tasks to completion, as opposed to just passively storing lists of things to do.
Yet another tweak of design might enable you to collaborate with multiple users, assigning tasks to people etc. – all to solve a bigger problem – coordinating people and teams to work together efficiently.
Identifying problems customers are experiencing: this is what defines a business. Implementing solutions to these problems: this is the engineering domain. Good engineering practice breaks a solution down into a set of independent components, aka features. This lowers the burden on individual engineers, as they need to keep fewer details in their active memory. A good division of labour also enables multiple people (or teams) to work in parallel, thus delivering the whole solution faster. Furthermore, it allows the solution to be delivered to customers incrementally. In some cases, it’s helpful to validate with business customers the fit and general direction of the development as well as creating an early engagement. So lots of positives! In contrast, waiting for everything to be ready and perfect before releasing the product creates certain challenges.
First of all, how do we know something is ready and perfect? Do you ask your customers, “Is this working for you?” What if you don’t have customers yet? What if your app is non-interactive, such as micro-controller firmware? No one to ask there. Thus we need a way to assess that solution is ready and it stays ‘ready’ as we update it.
Unit test – a piece of code specifically designed to test another piece of code. The need to create unit tests originated from the constrains of micro-controller development where the effects of bugs are not readily observable.
In a good project, there is always a backlog of features we need to deliver to make an app great. This creates a dilemma; should we prioritise the development of planned features, or write tests for features already in the works, or even existing ones. In my experience, the most common response to this is – “we need to focus on getting that customer value out and we can fix things later, if ever”. Another answer is, “Or we can just write better code, right? It’s also the most incorrect answer possible.
To understand why this is incorrect, we need to establish the point at which a ‘feature’ is actually complete. For some projects, the answer might be simple: when the code has been written. A slightly more mature approach is to consider a feature done when a user can use it, but if it’s not ready, we don’t want to put it in front of a user. How do you assess that a user can indeed use it? Some test manually – playing the role of a user. That requires some knowledge of users’ business processes to know how they interact with the software. And how do you assess that users can use it on an ongoing basis, even after the next version has been released? Test every release, of course! Every feature in every release! For all time! That’s a lot of testing. Have you budgeted time for that?
What I am talking about here is the fact that all features have a life-cycle. They are conceived, implemented and released into the wild… and occasionally – removed. What that means for the engineering team is that they need to take into account that once a code for a feature has been written, it will go on to live a separate life. It will likely stay alive through the ongoing evolution of a project as other components are added and updated.
Here is a personal anecdote: I had joined a team and started working on a new project that had been in production for a while. My task was to add a new feature: “No problem – it seems like an independent and well-defined piece of work”, I thought.
Once the code was done and passed all the tests it got approved by other engineers. So I merged the branch and promoted my version to a staging environment for internal testing. I’d tested it as well as I could and all seemed to be in order.
Sometime later an architect came looking for me asking why I had broken screen
. He told me I should revert my changes! I was surprised, as I had checked the version in staging and screen "blagh" seemed in perfect order to me. I asked the architect, "Well surely you have tests for this screen functions, and they would have picked up any issues?" "We don't have 100% test coverage!" was the answer. I looked at my screen, and then at his, and they looked different. After a bit of investigation, we realised that we were using different testing accounts, and the architect's account had special features enabled, which changed the appearance on screen "blagh". These features were only available for a special type of user, and I had no idea they even existed for this app. It turned out that this architect was the only person in Engineering who knew about these features. It made me wonder what other features I didn't know about and had not tested. More importantly – what would have happened if the architect had been on holidays? Customers that actually use those special features would have had a bad day. Not a great prospect.
As engineers, we can’t reasonably expect everyone to know everything. This is especially true for new engineers joining our teams. They’re not going to know all the features of an existing product, especially if it’s not a consumer product. They can’t test features they don’t know exist. However, that’s in essence what is required of a new engineer delivering a new feature, in the absence of automated tests; they need to make sure all existing features are intact when something has been fixed, or a new feature added.
A natural human reaction (I assume here that most developers are human, but in case our robot overlords find this blog in a distant future – this is true for all beings) when approaching the unknown is to slow down and observe the environment, checking it for dangers. This ancient adaptation works well even in our modern days where our environment is not a forest, but social landscapes, product road-maps (past and present) and existing codebases. Indeed, no one wants fresh devs jumping into a codebase, changing things without considering lurking dangers. These dangers may be in the form of breaking things and processes one might be not aware of. In other words, developers naturally feel afraid to change things because they might break. This fear is paralysing. Yet this is the opposite of what developers are hired for – to make changes and thus advance products.
One proven way to alleviate this fear is to create a safety net of tests. For one – we can reduce the damage of a ‘failure’ by moving it from the production to the development stage. If failure is detected and remedied before it reaches customers, there will be no externally observable impact (other than disappointing your team that things are not perfect and causing a bit of delay for fixing – which should be budgeted for anyway). With good test coverage, if something important breaks due to developer actions, a developer will see it straight away with a fresh knowledge of changes and is in control to fix it. I am sure a lot of readers would recognise this principle as “refactoring without fear”. Racing car drivers can drive at full speed because they have confidence in the safety of the track. In a similar way, when developers have an adequate support structure we shed our fear and can move fast to make changes.
What does a proper support structure look like? We want to be able to add new features into an existing product (which may be just an app starter example – for a new project), and be sure all the current features remain working. To make sure a new code does what it is expected to do, we have unit testing [REF]. The unit, in this case, is a bit loosely defined as – function/method, class or module.
What about the interaction of features? That’s what integration tests are for. Here we test a feature or a module interacting with another module using its specification or protocol.
(A module exposes a service or a contract to the rest of the system to consume. It also may require other services to perform its function. This services are part of the module environment and consumed via their respective contract).
In other words – two sets of tests are complimentary:
So far, we are looking from a single module perspective – making sure it does what it says on the tin and interacts with the rest of the ‘world’ expectedly. This is great for the development phase, but how do we make sure that the world/environment is indeed what a module expects?
A real production system is quite often composed of many services. The simplest example is a web-service + database (DB) for data persistence. For such a system, we can test that all calls to the DB are logically correct and we can test that we produce expected actions when a service is triggered. What if our connection to the DB goes down? Is the service still functional? Can we say that features are usable by our customers?
Of course, the DB going down is a trivial example. Let us consider multiple services interacting - of which I have a perfect example:
On the project I worked on, the user management, and permissions management sub-systems were separate services. My team was responsible for the development and deployment of a service that consumed permission management. We deployed new versions of our service regularly, testing every new feature. One day we got a ”call’” that users could not use our service. Interestingly there were no new deployments on that day, so it could not be a new feature breaking things. We spent some time investigating what was wrong.
One benefit of dividing a system into services is the ability it gives to develop and deploy new versions of services independently. While investigating, we discovered a new version of user-management had been deployed to production a day before. It had been tested with unit tests and was working for most permission service consumers. However, that version had a bug that only affected a small set of protocol features our services used. The issues were resolved by rolling back the permission service after a lot of argument with the permissions team, whereby we needed to prove to them that their service was indeed broken.
How could this issue could have been prevented? There were two services in production, and all was working. A new version of a service A had been developed and tested. Everything was ok, so it was deployed to the production environment. Everything was still great. The team responsible for A was happy.
Team B, responsible for the permissions service, added new features and refactors existing ones. It tested everything(?), all seemed fine. It promoted the service to production, and again, everything seemed ok according to the team’s checks and tests. However, users of service A now come online and find out they can’t use A anymore. Team A is unhappy :(. Yet both services are up and running ok. They just can’t work together and as a result, end-users can’t use service A. Note that both services are “healthy” from metrics perspective: CPU and memory consumption is in check.
This example brings us to an interesting point – when do we stop testing, if ever? It is understandable that as engineers, we want efficiency in our processes. Thus we wish to minimise wasteful actions. However, testing in production seems relatively inefficient. Nevertheless, our applications are, more often than not, designed to function 24x7 and always be available. Thus we need to test that they work 24x7. An alternative is to pass testing onto our customers, but they won’t necessarily be happy to take on board such responsibilities.
When we provide a service for others to consume, we create a contract that states the availability of the service. It may be implicitly stated that a user should always expect the service to be available. In this case, breaking this contract creates adverse reactions from customers. Alternatively, the contract may explicitly state the availability, for example, in a Service Layer Agreement (SLA). This is a broader topic for next time. In any case, it is the service provider’s responsibility to ensure the service’s available for customers. Automated testing in production serves to achieve this goal.
The reason I use the term “automated testing”, and not monitoring, is because monitoring does not include a notion of who triggers an action/process. For example, in the case of passive monitoring, if users never exercise a function, we won’t know whether that function is available and adequately implemented. The reason to care for such rarely-used functions is enormous; they might be for emergency use, for example: the last thing we want is for our monitoring to inform us that backup system is actually down – when we decide to restore from backups :).
There is one more benefit of creating tests for your project – it is a robust way to capture business requirements and assumptions. For example, let’s assume we have an app with private user-generated content, and we want to only allow “allow-listed” people to comment. This is part of the business proposition of the app, and it is easily testable with automated tests. Not only can we make sure this requirement is implemented, but having the test also ensures that additional features won’t break this requirement when new developers decide to change the management of permissions. If the test for this business rule is violated, the tests should fail, which should prevent us putting the broken version into production. Without these tests, the results could be disastrous; one can imagine the unpleasant surprise of app customers when after the latest platform update, they get comments on what they thought was private content. (It’s not like it never happened before: Twitter, Facebook, Instagram, Google, Cloudflare, etc )
When we don’t have an exact requirement or well-defined protocol, which is often the case in start-ups that are only just figuring out their product and business rules, it may be useful to write a test to capture assumptions. This happens when a number of services constituting a system are being designed and worked on at the same time. One dev team can assume the expected behaviour of service and implement it in a test. Later on, when the service is available, this assumption can be tested by that test.
It is not always possible to cover all the corner cases a system can create, and some bugs do slip through the test coverage net. This happens because systems naturally grow in complexity, and it shouldn’t make us feel defeated. We do tend to test only the cases we can predict and model. The real key here is to make sure our test coverage increases as new cases and bugs are uncovered. For example, you (or worse, your customers) have discovered a bug in the product. A normal engineering flow is to try to reproduce it, to 1. confirm there is indeed a bug, and 2. capture any relevant factors that make the issue to manifest. Once this is done, you should be able to write a test to make sure you captured the problem. The test confirms the presence of the issue. Once the solution for the issue is found, which may not happen immediately, the status of the test should indicate that the issue been fixed and you can assert that it is ‘done’. But having this extra test for a fixed problem in the codebase has an extra benefit – it makes sure the issue won’t resurface after a while. Continued improvement for the win :)
I took some time to reflect on the approach to testing as part of software development life cycle. Most people I talk to have a pretty good understanding of what testing is, yet when they talk about the project they are busy working on – they seems to downplay the importance of it. I believe this is due to a combination of short term perspective of SDLC coupled with our preference for instant gratification. And this is what I wanted to break down in this post. It is relatively easy to find issues early on with tests then to manually check all the business cases every time a new version is deployed.
Having good test coverage and discipline also yields extra benefits such as documentation of business rules and assumptions. Ensuring software is working is more than just writing unit tests or gathering metrics. It requires analysis of the SLA that an app provides. Of course, there is way more to the topic that I can cover in a short post so that I might return to this one later on.
Another topic I have not covered here is what we can learn from simply talking about testing with other engineers. My favourite topic when hieing engineers is to ask about their experience and approach to testing. It always yield profound insights into a person’s understanding of SDLC. Maybe I’ll cover it the next post.
What is your opinion? Where does testing sits in your development cycle? And most important: do you budget time for testing or in your organisation it is part of the normal development process? How do you make sure your application stays up and running?
]]>Software engineering can, at some level, be viewed as an art, in which computer programs are created to solve problems. At this level, a program is just a serious on instructions not unlike a recipe for a meal. Just like with recipes we are interested in results not in the actual actions. The only issue is someone needs to follow these instructions in order to make an actual meal. In case of a computer program, an entity executing instructions is called a computer. The ‘cooking process’ of transforming data into results is called computation. A simple meal can be prepared by a single person, but an elaborate banquet for many guests might require many cooks for it to be done in time. And so is with computer programs – it can be run by a single physical machine, a laptop, or a phone. Depending on the problem being solved, more than one computer might be required, in which case we would call the group of computers a system. These machines might sit on your desk, or the computer system might live in different data-centres spanning the globe. And the computers themselves might not even be real, but virtual machines. I’ll talk about the general engineering principles of such virtual systems.
As for kinds of problems being solved, it varies. I like to think that we, as software engineers, focus on important issues, which is why we must ensure that good practices are developed and employed. I will leave the topic of specific problems for a different post and instead focus on engineering principles that can be employed.
For the purpose of this post, I consider a computation problem to be the transformation of some input, usually known as DATA, into output or results by a computer program. The key here is that the same program can be used to transform different inputs into different outputs, so a number of the same programs can be executed independently.
This concept of IO and transformation is important for understanding different types of problems and how they influence the design of transforming systems. For example, if input data is relatively small, as in the case with an HTTP request, and fits into a single machine memory, one machine is sufficient for data processing. In this case, a DS might be required to fulfil many requests simultaneously. And if there are so many requests that there can be no single machine powerful enough to handle all the traffic – we call it a web-scale problem. An essential characteristic of this type of systems is that data processing is distributed in time. Not all requests arrive at the same time. Instead, each node of the system is conceptually processing one request at a time. Multiple nodes can process requests at the same time speeding up overall request throughput. This way capacity of the system to process requests grows linearly with each added node. That is, in theory at least. Linear growth is only possible if all requests are independent and network scales with a number of nodes.
A very different system might be required when a whole data needs to be processed at the same time but does not fit into single machine memory. In this scenario, DS might be conceptually imagined as a distributed memory substitute for data distributed in space. There is yet another case, in which data may fit into the memory, but the transformation process is slow. To complete overall data processing, more computers are added.
Depending on the problem being solved, one might need a distributed system, whether it be distribution in space, time or both. These are some fundamental requirements that enable a set of computers to work together.
A bunch of computers are considered a system only if they are solving a problem. Which computers do by running a program that operates on data in computer memory. Therefore a bunch of computer solving a problem means that there some data (maybe partial) for the same problem. It is not required for computers to run the same program or to have exactly the same data. However, for them to work on the same problem and thus be in a system – programs executed by individual participants must be related, or data should belong to the same problem. This implies some form of communication, maybe between a client of the system and the servers. Servers often talk to each other, and at the very least one server needs to give the computers a task and collect results. More formally:
A distributed system is a set of independent computers (called nodes) interconnected to collectively participate in problem-solving.
(In theory, a collection of 0 or more elements is a set. In practice, it proved difficult to run programs that solved anything with 0 computers, which is why 1 or more compute nodes is required in a system, which can make meaningful progress towards a solution.)
This definition highlights what makes distributed systems important and challenging at the same time: The computers are independent. Distinct computation entities run at their own pace with their own compute and memory resources. Note that it is not expected that computers be separate physical machines. Only criteria are for machines not have shared memory. This is why a 32 core machine not a DS as all cores have access to the same memory. At the same time, 32 virtual machines running on the same physical 32 core machine but not sharing memory – is a DS. Having an independent component means they make progress independently of one another. They can start, stop, and fail at any time and in any order. It also means there is no shared memory between components. Depending on how this independence is taken into account, designed systems will have different properties. We will discuss the options and challenges later.
To make sure that all computers solve the same problem and don’t step on each other’s turf, they must work collectively; a bunch of computer doing unrelated things is not a system. And if one computer repeats work that was already done by another computer, there is not as much overall progress. Therefore, a degree of coordination is required. Coordination always requires communication, which for computers requires interconnection; in order to work collectively, nodes must be interconnected.
There are many ways to make computers talk to each other, not only on the physical level (Ethernet, DMA, WiFi, etc.), but also logically, which gives rise to various topologies with different characteristics.
Communication, however, is also the main cause of challenges within distributed systems. No matter how fast interconnect is, it is still slower than local memory access, which means latency is one aspect of DS that requires management. If there is too much communication and it is inefficient, the DS’s performance will suffer, potentially to the point that it is no longer a viable solution.
The other problem with communication is that it can fail. In local systems, if we can’t access the local memory, there is nothing left for software to do but to crash. DS, on the other hand, must be designed with communication failure in mind because it happens often. The software must also account for the healing process to ensure the system is dynamically re-balanced as communications with parts of the failed system come back online.
There are, of course, much more challenges than that, such as consistency, ordering, and monitoring, just to name a few. I’ll need to write a separate blog post to give them a proper overview. For now, let’s examine when we need DS and when we don’t.
Not all problems benefit from multiple computers. Engineering is required to design solutions that can be distributed. There are a number of ways to approach the design depending on the task. For example, a distributed system can be designed to solve many similar small problems – small in the sense that it can be solved by an individual computer but there are just too many of these small problems. Fundamentally, it is not unlike a time-sharing system, and examples include web applications serving web pages. Each individual page is usually easy enough for a single computer to produce. Still, a single machine might not suffice to serve web-scale traffic with hundreds of thousands of requests per minute (I am personally familiar with systems that struggle with only dozens of requests per minute). This class of problems requires distribution of incoming requests or jobs to process page requests, among a number of participating nodes. We can imagine a queue of such jobs waiting to be picked up for processing.
Another class of problems for distributed systems are single large jobs – large in the sense that a single computer might not have enough resources to solve it in a feasible timeframe or even fit all the data into its memory. Photo-realistic ray-tracing rendering used to be a popular example of this problem, but modern video cards mostly alleviated the issue. Instead, these days we talk about “big data” problems. One computationally challenging example is weather modelling and forecasting, which requires an astounding share number of parameters and a high resolution of modelling just to produce meaningful results. Surprisingly, a single computer (of the time of writing) can be used to model weather and make forecasts for the entire planet. The only issue is it takes a really long time, potentially weeks, to produce. By then, the forecast will not be very useful. This is important to consider when designing a distributed system; even big data problems can be solved by a single machine using engineering techniques for memory by treading off performance. And for some really important problems, timely results are critical. Weather forecasting must be complete before the date of the forecast. Traffic routes must be computed timely to ensure traffic flows. And for an autonomous vehicle, decisions must be made in milliseconds to avoid collisions. One important observation here is that most problems in which solutions are used for real-world applications must be complete before a deadline, which depends on the problem and can range from days to nanoseconds.
I personally find DS fascinating. We interact with these systems daily; all payment processing, transportation, and even accessing the Internet you use to read this post rely on DS.
For a while now, I’ve been playing with the idea of building my own DS for a machine learning application I have in mind. The one question I’m hearing again and again when I mention what I’m up to on the weekends is:
“Why do you want to build a distributed system?”
I think the actual meaning most of the time is:
“I’ve heard distributed systems are hard to design right and difficult to operate. Are you sure it’s worth the effort?”
This question better highlights the motivation for creating systems: the economy.
It all depends on the problem and its requirements. If there is a single inexpensive computer that is sufficient to solve a problem, go with it. Note that there is an implicit assumption that the size of the problem won’t change, meaning if we choose a machine that can handle all the data for our problem today, we believe that either size of our data and demand on our system will never change. Kind of like a system that never used or didn’t get popular. We assume that this same machine will be able to handle more data in the future as demand grows, or we can replace machine for a bigger one to match demand.
There are a lot of scenarios to consider, and good businesses always expect the demand for their services to grow. In some areas, more data means better results. This is why good engineering practice is to design systems with potential growth in mind.
Of course, sometimes growth just means buying a bigger machine. The only issue is that bigger machines are more expensive, and there is a limit for how big a single machine’s CPU and memory can be.
A different approach for growing compute power is to buy more machines and add them to your system. Scaling systems out is more efficient than scaling them up. For example, to get 64 cores you can buy a single CPU for $4000. Alternatively, you can purchase 16x4 core machines, which are less expensive. In fact, 16 4GB Raspberry Pi 4s cost only $1,600, which also comes with 64GB of memory total. It is even cheaper if you buy compute modules at around $600.
The downside, of course, is that all of these computers will need to communicate and will spend part of their CPU power doing so. Plus, you will need to manage traffic and membership, as computers can join and leave the system at any time. I do have a separate library to deal with dynamic group membership and failure detection, which I will introduce in following blogs, so stay tuned.
And speaking of failure detection… A light-hearted definition of DS by Leslie Lamport:
“A system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”
His definition highlights an interesting property: failure tolerance. When dealing with a single-component system, a failure of this component is equal to a failure of the whole system. (Think of comments typed on your phone before you send them. If your phone were to explode, no one would ever know what you typed. A single component is your phone with comments.) However, there is an option to design multi-component systems in such a way that a failure of a single component does not render the whole system unusable. Think about your phone with unsent comments failing – it doesn’t mean that this blog should become inaccessible to everybody. There is a chance for someone else to hear that falling tree in a forest. :) And for this option, which takes some deliberate effort to realise, distributed systems are worth considering.
Keeping failure isolated to a single failed component has another advantage: maintainability. It is worth noting that in some respects, taking one component out for service (such as rolling out an update) is the same as failing this component. The main difference is that we take something offline for service intentionally and often with advanced scheduling, which reduces the impact on the whole system and its customers. On the other hand, failures that are out of our control happen whenever they please. The worst kind is when it is least convenient.
I hope this post provided a good overview of what distributed systems are and why we build them. The economy is definitely a significant driving factor for braving the challenges of designing such systems. As engineers, we not only need to solve critical business problems but do so economically. Sometimes that means solving it in a timely manner (as in weather forecasting). Other times it simply means that there is no single computer large enough to serve all the requests. Some of the side-effects of a well-designed DS may be requirements for your project. There is no way to have fault tolerance with just a single machine. In order to keep service online without interruptions during maintenance, some form of multi-component solution is needed. Of course, this post is only an introduction, and no discussion of DS would be complete without mentioning the fallacies of DS. I might dedicate a few posts to this topic later on, but for now, I’ll leave you with links to a very comprehensive discussion of that topic:
]]>A filesystem was initially designed as a means to store computation state. Over time actual storage medium changed, and the need to store and read data remained. Application developers wanted to focus on useful computation that app performed and not on supporting various hardware that application may end running on. And so the system that the application interacted with to store data was abstracted. This abstract concept of storage became known as the filesystem. Thus in a classic sense, a filesystem is an API provided by an operating system that allows users of a computer system to store data and later access it. This process must work with storage medium details varying from configuration to configuration. Which leads to the realisation that access to the data is more important than the fact of its storage. If we drop the notion of persistent storage - we allow for data to come from anywhere. It can be computed on request. For example, reading a value from sensors, data from a remote server or kernel itself. For computer - everything is a data.
Filesystem is just a way to access named data. So virtual filesystem is a recognition of the fact that is it is not as important where data physically stored. But it is essential how to access it.
Filesystem as a concept is a mapping of a hierarchical name to an object with some expected operations. Namely read and write.
Here I mean mapping in the mathematical sense. There exists a function that accepts path
and produces, or maps it to an object.
In pseudo code it will be something like:
Filesystem fs;
auto maybeFile = fs.fileByName(path);
There are two critical things to note here:
There is no guarantee that the mapping path
to file
exists for any path. Or simply, not every path results in a file.
Path is hierarchical. Two distinct path object can refer to the same file object.
The reason why this is important is that it might be tempting to compare a filesystem to key/value storage. Or assume that in OOP such system can be replaced by a simple map
/dictionary
/hashmap
/etc.
Real filesystems provide more operations than just mapping. It also allows the listing of items in directories or walks up and down the hierarchy.
For this post, a file
is a nameable object that has write and/or read operations defined on it.
Note in this definition a file is just something one can write into or read from or both. Also, we can find it later by name. Or the hierarchy of names also known as path.
So filesystem as a concept is a mapping of a hierarchical name into an object with some expected operations. What can we do with it?
One example of virtual filesystem a reader may be familiar with is a procfs. The idea proposed for Unix that processes running by OS can be represented as files.
Not stored files, but OS can give us this information on request. In Linux, this is implemented as /proc
. Every time a user reads from this directory, Linux kernel handles the request and reply with information about currently running processes.
Another familiar example of virtual filesystem is devfs. This system represents all devices attached to a machine as files. On Linux, this traditionally mounted under /dev
. This files can be read and written.
There is a file for a camera attached to the machine. One for each hard drive. Files for USB ports. Not only real physical devices are represented by this FS but some purely synthetic ones. /dev/null
- a file that always read 0
. You can also write any data you want to discard into that file too. Need some random data? Just read from /dev/random
!
There is, however, an example of an operating system that took the concept of the virtual filesystem to the whole new level. Plan9 is a system that represents most of its subsystems as filesystems.
A mouse is just as /dev/mouse
, read it and you know mouse position. Need to open a new TCP network connection - write to /net/ctl
connection details, and it will create a new file /net/tcp/<number>
for your connected socket. Windowing system - is VFS too, naturally, with a file for each opened window.
This approach allowed to create applications that interacted naturally with the system and each-other. No separate IPC required. If one application wants to interact with another - it will open relevant files provided by app-server.
This idea is really not that dissimilar to the notion of microservices we hear so much about these days.
Whenever three is an interaction between processes, some form of access control is necessary. Even in a basic form of persistent storage, a filesystem is concerned with the management of permission a user might have for a particular file object. And so is when designing an FS one need to consider the representation of user objects and permission scheme. Unix traditionally offered a simple solution in the form of numeric pair of user-id and group-id. And each file having exactly one owner identified by that user-id. This own controls permissions for the group and the rest of the users. Other FS implementations have more detailed permission schemas such as [access control lists][acl-wiki] (ACL). In this approach, each file has an associated list of users who are allowed to do what with the file.
As mentioned in my previous posts, I am currently interested in building my own grid computing solution. This means that I will have a bunch of processes running on a number of distributed machines. These processes need to interact to achieve a common goal. With means, I need an IPC model. And I chose 9p file server as my model as it has been proven by Inferno and Plan9 OS.
I already have my custom implementation of 9p protocol parser and have a couple of servers using it. One thing I have to repeat with every implementation of a new 9p server is a mapping of hierarchical names to some virtual objects. That is to say that my servers don’t directly map to a real FS but represent some ephemeral object. I will publish a post with examples later. For now, I have been busy extracting common code required to write a new 9p server which I intend to publish soon.
In this post, we talked about what a filesystem actually is. Turnes out if you take out the storage of data out of the FS concept you get mapping of hierarchical names to IO objects. This object can represent on-demand computations or physical devices as well as plain old data. This is a very powerful concept successfully used by some operating systems. It is also a good IPC model for a grid computing system. The kind of system I am building. Which means I need my own implementation for VFS that I can use to write services. I am currently working on such implementation and hope to share more details soon.
What do you think about VFS as an IPC model? What else can be represented as a namable hierarchy?
]]>9p2000.u
and 9p2000.L
.
Original 9p protocol designed for Plan9 to be minimalistic.
9P is a distributed resource sharing protocol developed as part of the Plan 9 research operating system at AT&T Bell Laboratories (now a part of Lucent Technologies) by the Computer Science Research Center. It can be used to distributed file systems, devices, and application services. It was designed as an interface to both local and remote resources, making the transition from local to cluster to grid resources transparent.
9P2000.u is a set of extensions to the 9P protocol to better support UNIX environments connecting to Plan 9 file servers and UNIX environments connecting to other UNIX environments. The extensions include support for symbolic links, additional modes, and special files (such as pipes and devices). Also included are hints to better support mapping of numeric user-ids, group-ids, and error codes.
(From 9p2000.u RFC )
According to 9p overview - 9p2000.u did not provide a direct mapping to Linux operations and thus 9p2000.L has been designed.
This extensions to the protocol took quite different design approach. 9p2000.u
adds no new messages but extend existing ones adding new fields.
9p2000.L
on the other hand - add the whole bunch of new messages that better match Linux FS operations at the price of doubling number of messages a server need to handle.
Please let me know if you found these extensions useful what is your experience with 9p versions.
]]>libstyxe
. As it is often the case - it took longer then I hoped. The new design is better suited for extensibility. The downside is that I had to break existing interfaces. So in this post, I would like to explain why I decided to do it and what goals I had in mind.
libstyxe is a parser library for 9p2000. 9p2000 is a network file system protocol that is a foundation of Plan9 OS. It is famous for taking the concept of “everything is a file” to the new level.
As I am working on a distributed system as a hobby - I thought it would be an excellent match to 9p as a protocol. That meant that I had implemented one in C++. libstyxe
is the result of this work.
I wanted the library to follow the same principles I use for libsolce: no uncontrolled memory allocations and not thread creation. This requirement somewhat limits the scope of what the library provides. Notably, the library features no IO/networking support. Also, message parsing should only provide views into the message buffer rather than copy data. Likely in practice, however, it is never an issue with 9p.
My first implementation had some experimental design choices. For example, I opted for builder/writer hybrid of message builder interface.
MessageBuilder
as the name suggests is a class that a library consumer would use to create a 9p message. In my implementation, however, no message object was returned by a builder. Instead, message data written into a data stream passed to the constructor of the MessageBuilder
.
To clarify the usage, I later renamed this class to MessageWriter
. It is also advantageous to keep RequestWriter
and ResponseWriter
as separate classes to prevent unexpected usage. We don’t want servers - that only ever write Response-type
messages to send us a request message accidentally.
For example, to write TRead
a message into a byte stream one would call:
RequestWriter writer{byteStream};
writer.read(fid, offset, count);
Notice how the type of the message is defined by a class used. RequestWriter
in the example produces a request
. Call to RequestWriter::read
writes a read-request message without creating an intermediate object. In that sense, it is a mapping of input arguments into a byte stream with some extra information about the message type.
The server for example, could reply with data message:
ResponseWriter writer{outputByteStream};
writer.read(data);
This design worked for a while.
9p is not a single a protocol. Several ‘flavours’ has been created over time. Notably:
My initial plan was to implement the smallest subset - that is 9p2000
. But after some time using a protocol on my server - I realized that some things could be streamlined to minimize network traffic. Also, if a server implements Unix-style user ids - 9p2000.u is a better fit.
That is why I realized that I might need to support other versions. Implementation, however, proved challenging with the existing design.
In order to support a new protocol version, a few changes required. At very least I’d need to add new methods to the ResponseWriter
and RequestWriter
to support new messages.
If that would have been the only way new versions differ - that would have been a minor issue to solve.
It turns out that some extension also changes existing messages. Thus 9p2000.u Stats
struct has extra fields.
Then there is also a problem of version negotiation. 9p’s first message in the session establishment sequence is a client proposition of protocol version and message size.
A server reply with its own preferred version. Thus if parser supports multiple versions ['9p2000', '9p2000.e', '9p2000.l']
- selecting one would change the parser.
In OOP terms it means that parser is polymorphic
and there is a factory that creates a parser version given a string:
auto parser = createParser(versionStr);
Given library constraints - I would need to solve it without allocating dynamic memory. Thus no inheritance.
Without going full “Inheritance Is The Base Class of Evil” here we may notice that parser is a mapping of message type-number and a byte stream into a message object:
parse(protocolVersion, MessageType, byteStream) -> Result<Message, Error>
I used result here to indicate that message type may be invalid within a given protocol version. Or data may be invalid. See my post about errors in the code.
Given that MessageType
is byte we can not have more them 256 mappings. So we can use a table of function pointers: parse[messageType] -> *parserFunction(byteSteam)
and to select
the table for each version. In other words, we have invented a virtual function table.
The new version of Parser
takes request parser table and response parser table. The extra benefit of using parser table this way is that all entries for ‘unsupported’ message types - point to
the same error producing function. So we have our ‘unsupported message type’ case covered.
It would be nice also to have some modularity. I want to keep the code for different version separately. This way it should be possible for library users to chose protocols they actually want to support.
Modularity should also help with future extensibility.
Thus, first of all, I had to redesign ResponseWriter
/RequestWriter
interface to be extendible without inheritance.
The way to do it? Good all standalone operator<<(ResponseWriter& , ...)
. What should be passed as arguments? Protocol message we would like serialized.
This, unfortunately, means I do need to have an object to represent each message. As it turned out, I already had it - this is the result of message parsing.
After separating MessageType
enums into independent modules, it turned out that MessageHeader
struct can not include enum field for type. Instead, it should be a simple byte.
It is easy to understand - the same byte value may mean different messages in different versions. Not the best practice but nothing preventing it.
The consequence of this change - is printing message name now depends on the negotiated protocol version.
Thus simple operator<<(std::ostream& out, MessageHeader)
is no longer an option as it does not accept parser version to be used.
A bit more verbose solution is used instead:
Parser::messageName(Solace::byte messageNumber) -> StringView
This member function of Parser
class returns string representation of the message name, if it is a valid messageNumber
value for a selected protocol version.
And so putting it all together - a new interface to write TRead
message into a byte stream:
RequestWriter writer{byteStream};
writer << Request::Read{fid, offset, count};
…and a server side response:
ResponseWriter writer{outputByteStream};
writer << Response::Read{data};
This design means that we do have to create temporary objects only to shuttle arguments into a call to stream::write()
. Lucky for us - all messages only view and don’t care about any data that requires allocations.
In software engineering, We often discuss over-engineered solutions and how is it a problem. It is essential to focus on a problem at hand.
Focusing resources on solving theoretical problems that may never happen - is a lousy resource management strategy. At the same time,
it is crucial not to designs yourself in a corner such that when requirements change - your code can evolve. Walking this path - is what the art of software engineering is about.
libstyxe
design goal was to keep things simple. And now the time has come for libstyxe
to support a new set of extensions. That required some rework and review of initial design choices interfaces.
I kind of like this new design. Keen to see how far it can take me. Let me know what you think about this approach to design of a message parser.
]]>Libsolace is now available on Conan central.
Conan is an open-source, decentralized and multi-platform package manager.
If you are using Conan for your project and would like to use libsolace
it is as easy as adding these lines to your project conanfile.txt
:
[requires]
libsolace/0.3.6
Just make sure to use latest version.
Let me know if you find it useful or you are using a different package manager for your code.
]]>For software to be useful, it needs to be reliable. Not only must it perform its function 24x7 but be resilient when unexpected things happen. To build such good software engineers must to follow good engineering practices. And in time we can notice that we keep repeating what worked for us in the past. This is how one forms a library.
libsolace is my attempt to capture some of the primitives that I found useful in the past.
I wanted to collect primitives that enable me to build different applications. Such collection is known as a library:
a collection of types, functions, classes, etc. implementing a set of facilities (abstractions) meant to be potentially used as part of more that one program.
Since we are foremost interested in reliability we should first identify some principles that a library must follow to be a foundation of reliable software. One such principle - is the property of architecture where using it incorrectly is difficult, better yet impossible.
A pretty good starting point is the P10: NASA’s Rules for Developing Safety Critical Code. A major take away for me was to write maintainable code in terms of size and readability and have explicit control over memory management. This, however, poses some challenges in the C++ world. Most notable dynamic collections and string has a tendency to allocate memory in the most unexpected moment. Note that it, not allocation that is the problem here - unless it happens on a critical path. But rather the time when it happens.
For example, it is easy to pre-allocate memory for a vector and ensure no resize penalties. Unfortunately, it is also easy to omit a call to vector.reserve()
with no noticeable consequences.
What if it was impossible to create a vector without pre-allocation? Of course, we can use a custom create
function to get an instance of a vector:
template<typename T>
auto makeVector(size_t initialSize) {
std::vector<T> result;
result.reserve(initialSize);
return result;
}
Notice there is no memory copy penalty when using C++ 17 due to copy elision.
Vectors are not the only STL components that allocate memory behind your back. The whole concept of PMR and memory_resource introduced in C++17 is there to confirm it.
When libsolace
were conceived there polymorphic memory resources were not yet proposed. In that sense libsolace
attempts to solve a similar problem of giving a developer more control over memory allocation.
Memory management is not the only aspect of software engineering that an engineer must keep in mind when designing reliable software.
In general, an application is required to manage various resources: CPU allocation, memory, disk space, networking, etc. That may be surprising at first: all of these are taken care of by an operating system, no? On the one hand, that is true - a primary goal of an operating system is to manage resources of the system. Well technically, the goal is to provide access to existing and limited resources collaboratively. So that apps won’t step on each other. There is still a problem of a finite amount of CPUs and Memory and storage and network bandwidth. The operating system gives out a slice of these resources to an application. If a computation requires more memory then available on the system - OS will try its best to accommodate a request. But it can not download more RAM for your app. Nor can it spawn a new CPU if there are no left or app takes too long to compute.
If you are familiar with web development - the solution there is to design horizontally ‘scalable’ applications. Question of how to actually do that makes for a good job interview question.
When a single instance of scalable web-app struggles to handle incoming requires - an operator creates another instance to run on a different machine. A web app can also be designed to handle a single function of the system, and the system is composed of several such web-apps working together. This approach is called micro-service architecture - has a subtle and essential property in terms of system resilience. In a well-designed system, failure of a single micro-service - does not result in failure of the system as a whole. Also - when the load on the system increases only the services responsible for an impacted function should be scaled - not the whole system.
Example: if your web app users engage in a heated discussion and the load on the comments micro-service increases - only this service should be scaled up. That is more instances of the comments
micro-service are to should be brought online — no need to scale purchasing services.
Given that we want to have a resilient system - we need tools that can provide us with reasonable control of system resources. If your app is CPU intensive - you don’t want a third-party library, the app is built with, to create new system threads without your permission to run some task. Interestingly enough - this reminds me of GC pause in garbage-collecting languages: GC runs in a separate thread can interrupt the application to collect garbage. Although tolerable in some cases - it is not precisely a desires behaviour for time-critical apps.
Another aspect of resource control - third party library should not run network queries unless it is the primary goal of the library. Performance-wise - there is the issue that networking calls can be blocking and non-blocking. And I have heard enough was stories about people chasing blocking calls in a library used by an async application. So networking model is an essential choice of the application design, and I wouldn’t want a non-networking library interfering with this chose. I’d rather have composable library that works with wide variety of IO models. But there is even more severe aspect of network calls from third-party libraries - security. I don’t want to go too deep into details but suffice to mentioned that there were cases in the past when a library would send data over the network.
Finally, some applications are concerned about when and how memory is allocated. Thus a composable library must provide such applications with means to manage memory. It can be in a simple form - each call that needs to allocate memory - must take a memory allocator as an argument. This is the way I decided to do it in libsolace. There might be other ways. I am curious to hear how else this problem can be approached.
When designing a library it is difficult to predict all possible ways a library user will use it. There are two ways to go about it that spring to my mind: be prescriptive or be flexible. A prescriptive library - or frameworks, or how it is called these days ‘an opinionated library’ - dictates the way it expected to be used. It usually defines a hierarchy of classes and order they should be wired together. In other words - it has a tight coupling between components. The upside of this approach - is that it is usually easy to use it given that the library-provided solution matches your problem and your approach in general. One of the libraries I created - libtribe - is an example of such approach - it provides a model and a set of actions to be performed on this model. This is known as Action-Model-View approach. If this is the way you model your domain - libtribe - matches your application nicely. And if not - hey - this is a good opportunity to learn more about this approach :P I might write a blog post about it :) (Yes - this is an example of opinionated approach).
The downside of prescriptive libraries - you can’t just take components you found useful and reuse them. The library might be not designed to take parts out of it. For example, you in is not expected that someone would take only message parser out of libtribe and not the whole model.
It is exactly the opposite situation with flexible libraries. It may offer an assorted collection of components but no instruction on how to put them together to solve your particular problem. All standard library of programming languages fall into this category. libSolace also follows this approach. A language support library is not designed to solve any domain problem - but rather give implementation level component.
In some sense, this debate opinionated vs flexible library reminds me of Composition vs Inheritance.
libSolace does not solve any particular problem but gives you tools that you can use to build a solution. For that to happen - these tools or components need to be composable. That is they need to work together.
One example of composition of components in libSolace is design of ByteReaded
and ByteWriter
.
These classes model a simple stateful data stream. That is a fancy way to say a byte array with an integer denoting current position to read from / write to. ByteReaded
and ByteWriter
operate over a memory buffer. A simples use case is to have a static C-style array you would like to write data to:
MyData my_data = ...;
byte data[256];
ByteWriter writer{wrapMemory(data)};
writer << my_data.x << my_data.y << message;
This snippet demonstrates how ByteWriter can be used to serialize data into a user-provided byte array. But the libsolace also provides memory management facilities. So a user can allocate a memory resource dynamically and use it to write data to:
MyData my_data = ...;
MemoryResource memory = memoryManage.allocate(321).unwrap();
ByteWriter writer{memory};
writer << my_data.x << my_data.y << message;
In this example ByteWriter is using dynamically allocated memory buffer as a write target. It is also possible to move memory resource into ByteWriter for it to take ownership of the resource
ByteWriter writer{memoryManage.allocate(321).unwrap()};
writer << my_data.x << my_data.y << message;
In this case - the memory will be de-allocated when the writer
goes out of scope.
These examples illustrate an important point - a library user is not required to use a memory manager or memory resources to use ByteReaded
and ByteWriter
.
It may be a good idea in some cases but not in others. For example if you need to interop with STL/ legacy code:
std::vector<uint8_t> buf = ...;
ByteReader reader{wrapMemory(buf.data(), buf.size())};
This way you are free to pick and chose components that are required for your solution.
This was just a brief overview of the philosophy and motivation for the libsolace
library.
Software is everywhere around us and we rely on it more and more each day. Thus the software must be reliable and resilient
I plan to describe how this philosophy is has resulted in
In the meantime, I encourage you to check its Github repository.
Please let me know what library component you found useful and which one’s purpose not clear. Also, please raise issues if you run into troubles.
]]>Generally, when talking with colleagues with some engineering experience - there is always a consensus that errors do happen. There is, however, rarely an agreement of what an error is?
It might seems like an odd question to ask: “what is an error?”. Surely we all know one when we see it. And why would you have errors in your code in the first place? Shouldn’t our programs be error-free? Yeah - real-world questions here. One way to think about a program is like a sequence of action to achieve some goal. For example: compute the result of a division of two variables; play the content of that audio file; display pictures of my favorite cat I have saved over there; let me enter this data in a file for the record and create a report based on that data.
Notice something common - unlike infamous “hello world” program that takes no input and always produces the same result no matter what - programs in examples are designed to solve real problems (such as displaying cat pictures!) - Although to be useful they all have some flexibility. That is - it is possible to use the same program to display dog pictures also. To do that - a program needs to take some input. After all, we may like different cats. Or music. Or have different business data. So different users have different inputs and expect different output - most of the time. This extra freedom - to operate on a user-provided data - comes with a price. It is possible that user inputs may be ‘invalid’ in some sense? Division by zero, a filename that does not exist, none existent cat species (Oh-NO!) What a program is to do in that case? Is that a program’s fault/error?
Now it is all fun and games - when you deal with simple programs designed to do one thing and operating with a single user. How about something more challenging: a service that takes input from 100s, millions, billions of request. How about banking service? Airline departure control systems? Any issue with a user input gets that much worse.
To make matters even worse - it is not always a single request with invalid input that is immediately noticed. Remember a_sequence of action: take money from account A (done), add money to account B. Boom! Account B does not exist. Error! And not money on account A anymore.
Surprisingly it actually can get even hairier when we consider sequences across multiple systems - distributed across multiple machines in different data centers.
When researching a topic of error handling for a proper article, the best I came across was a classification of errors based on when they happen. Syntax, runtime, and logic errors. This is repeated across a countless number of blogs. I found this misleading and most unhelpful. For one - if this is the orderer list - logic error should precede compile-time errors :) You need to make a logic error first then implement it for it be compiled. This whole false classification reminded me about taxonomy of animals. But jokes aside - this is not helping anyone because it gives no direction on how to approach errors. This is why I wanted to first review this common classification before moving to what is actually important for any production system.
Logic errors are not computer related per se. It may be a failure to analyze the problem domain. False/noisy data. Or straight blunder. And let us not forget that it is possible to commit errors while writing code. There is little a software can do if it is implemented with a flaw. Good processes and procedures tend to help to address logic error. RFCs, documentation and peer review - are all good tools to make sure requirements are valid and logical. The more people with domain knowledge check the specification - the higher the chance of it being valid. And more people review the implementation of that specification - more likely it to be valid. (Although on this note I should mention a personal anecdote - when particular changes were spec’d and reviewed, PR reviewed and approved - only to accidentally remove a good chunk of critical, but unrelated functionality. 5 people reviewed and approved. 2 spec’d and we still ended up with an incident in production. We had to revert this changes urgently. So it is not always about quantity, but engaging people with DOMAIN knowledge early.)
Syntax error - is an attempt to pass to a compiler an ill-formed program. This is the reason why compiled languages exist. That and performance. - To prevent ill-formed programs from being executed. Modern compilers do an outstanding job at that. So much so - that it is a good technique to leverage type systems such that a logic/coding error results in an ill-formed program caught by a compiler. In case a non-compiled language is used - it is still possible to make a syntax error and have your program running. Until execution path hits ill-formed block that is. The advice here - have extensive unit tests. If there is a piece of code that is not exercised by a unit test - chances are there is a syntax error hiding there. Assume program does not work until proven otherwise.
Category of runtime errors - is the broadest one. And most expensive. If we understand the problem and able to implement a solution as a well-formed program - we are free of the first two ‘classes’ of error. But it is runtime errors in your system that prevent users from using your software, what customers complain about on forums. It is the type of error that wake you up at night - when your service is down. This is the type that of interest to me here.
There is also an important dimension of this classification that is not always mentioned. Frequency and severity of the error happening.
All the above classification is of little help when it comes to handing the actual error in runtime. So what can we do? In order to identify what actions can be taken in case of an error occurred, let’s consider different types of applications. There are ‘hello world’, and real-world applications that some take input and produce some output. The input may be indirectly (command line, config file, script) provided or directly - read from keyboard, mouse, gamepad. It can come from a network. Or - to generalize here a bit - a sensor: camera, keyboard, network device, IOT type sensor. Real word applications tend to produce some output that depends on the input. Response to network messages, display picture, write converted file - etc. This gives us an interesting concept to work with: a valid input should produce valid output. If it is not the case - a program itself is incorrect. There is a bug in the application that must be fixed. The focus of this article is to explore the failure of the valid program because using QA techniques such as testing - it should be possible to determine if a program is valid. That is given a valid input sample it should be possible to establish if it results in a valid output. That does not guaranty, of course, that we have no bugs and all possible inputs won’t uncover bugs. That is why it’s important to have a representative sample of the input to validate the application.
Failure to obtain an input (valid or not) thus must results in a failure to produce valid output. Of course, it is also possible to fail to ‘store’ the output even given valid input. That is - display has been unplugged just when that can picture was about to render. Or more realistically - network drive crashed when the video converter was writing the output. The network dropped in the middle of an upload. It is also possible to fail to produce an output given valid input and valid output device - if the process failed to secure intermediary resources required to process the input. Out of memory to store the input etc. What makes this one particularly annoying is the fact that - resources required for processing can not always be predicted as it depends on the input itself. For example, if you want to view a raw uncompressed picture 8k picture of a cat - you might need a lot of RAM to fit the file in. Or a special app that can handle mem-mapped files.
So plenty of ways to fail. On the upside - understanding where failure can happen is critical to decide what to do about it.
Now that we know where valid bug-free programs fail - what options do we have? Turns out “it depends on the app”. It is one thing when the picture viewer app can’t find a file name because of a type. The app can quit and restart. It is a different case when a server receives a bad request. It is not a good idea to close the whole service for other users. What about critical systems where failure is not an option? Let us explore types of applications to see if we can generalize something.
The way I come to think about apps is - it is either a ‘regular’ application or it is a ‘service’. For the purpose of error handling, we will have to (re)define the scope of what that actually means in terms of IO.
Regular - means applications that have all the data they need to perform an action - whatever it is - readily available. This are ‘simple’ Unix-philosophy apps.
ls /
, rm -rf /
, etc.
magick convert rose.jpg rose.zng
We can observe that these applications do have their input given at the start. It could have been a config file or a script. The point here is that the input is specified. The only problem is - it can be invalid: zng
for example is invalid file format, file rose.jpg
does not exist in a current directory. The output is also defined. Store results in a file. It is possible you don’t have write permissions. Or disk space. In any case - an error here - is the failure to acquire input or output resources. And the way to resolve it - is to correct inputs. The only interesting twist is that if you specify the output device as part of your input - this is an input error. Confusing :)
An example of an output error - is if you print something and printer jammed in the middle of a print. (note to self: does anybody still prints things?). Ok maybe - netcat <remote_ip>
and drop WiFi in the middle of transmission. The point is - output can fail too.
So what we do in this case - we try again. That is to say that the whole process can be terminated at the point of error. Important note: all the inputs are immediately available. So a process can be restarted. In most cases it is also reasonable to assume that no output has been persisted as error prevented an app from producing one.
(yeah yeah I know - netcat
can read from stdin
, or a pipe with ephemeral results - this was just to illustrate a point)
A different story when a process is responsible for the acquisition of data and its transformations. Single player games come to mind as examples most users may be familiar with. A player smashing controls and some action takes place on the screen. Although in this case, we don’t expect a user to press an unsupported button.
A service or job queue worker is a different example of the same concept - the input is acquired from some kind of a source is not known beforehand. In this case, input itself can be an invalid message. But it should not result in interruption of service.
Indeed it would be a pretty bad service if one user generated a message that brings a service down for everyone.
Thus in the model of message/transaction processing, it is possible and beneficial to isolate failures. One can think about isolation as each message/transaction being processed by a separate process. If the input is invalid - good old fail with error message strategy works AND it does not affect other transactions. In fact, this model has been implemented in inetd
and an early version of Apache
web server - where each request has been handled by a new process. The downside is
spawning a new process for each request takes too long and each process has memory overhead. Some improvement of that approach is a pre-fork model - where a number of processes created beforehand in standby mode. When a request is received - it is handed off to a free process/worker. Number of worker-processes can be monitored and when one crasher due to bugs of input issues - a new one spawned in its place.
In order to fight memory overhead of processes - some servers chose to handle requests in threads. In fact, most modern web-servers are threading based.
In this post, I tried to untangle the mess of error handling. Going beyond simple bugs in the implementation, errors can occur in a course of normal operation. Failures DO HAPPEN and an application must have a strategy to detail with it. Crashing the app may be a perfectly valid way for simple apps. In case the app does not deal with input acquisition itself but gets all it needs from the start. User-facing apps can produce: “Invalid input - correct and restart”. The interactive app has an option to maintain a dialogue. You don’t want to crush the whole app if you walked a user from a serious of actions only to start again. Maybe just this one input. Services take this step further - they have multiple independent inputs and in case one was invalid - it should not prevent others from being processed. This was more of an overview of what is there. I wonder if I missed some important aspect of error handling here?
]]>libstyxe is a somewhat simple library - its only job is to parse 9P messages out of a user-provided bytes. And write such messages into a byte stream. In this case, the error handling strategy is to report the error back to the caller informing them of the invalid input. That is if the library itself has no implementation issues. Will unit testing should help find later. So let us take a closer look at implementation details or error reporting.
While refactoring libstyxe I also realized that the way I have implemented encoders and decoders does not land it for reuse of extensibility.
Having member function such as Encoder::encode()
and Decoder::decode
overloaded for value types is easy to develop. All of the overloads have the same
return type of Solace::Result<>
. However, if I were to add a new data type (for example to support 9P2000.L) - I will need to ‘extend’ encoders and decoders.
Since designing libstyxe
- I didn’t want to use inheritance to extend classes - I been thinking how else we can do it a way that preserves error handling.
To find a way for Encoders and Decoder to be extensible and reusable while keeping error reporting in the familiar form of:
return encoder.encode(value1)
.then([&]() { return encoder.encode(value2); });
That is to say, Encoder can take different input types to decode. And encode operations are chainable such that if one operation failed, the following is not performed.
Oh, do that without using inheritance!
Initial implementation of styxe::Encoder
and styxe::Decoder
were designed to return Solace::Result
for each operation:
styxe::Encoder encoder{...};
auto result = encoder.encode(my_value);
if (!result)
return result.getError();
result = encoder.encode(my_other_value);
if (!result)
return result.getError();
This is all fine and familiar, but my personal experience suggests that error handling is not optional (all puns intended). That is if error handling can be ignored chances are it will be ignored. Most likely when it is needed. In code terms it means nothing stopping someone from writing code like this:
encoder.encode(my_value);
encoder.encode(my_other_value);
It is obviously shorter to read. It also has less obvious issues: if the first encode
operation fails, second should not be called.
That is to say, we want to have robust error handling. Throwing an exception to signal an error is an option. Except it does not
make it any clearer what error is expected. One would need to decide if it is an exceptional situation.
But more importantly, the problem of error handling remains. It is not enough to throw an error - the error must be handled in a meaningful way.
Working on a support library libsolace I decided to experiment with a different approach to error handling.
That is to return a Result<>
type that can either be a value or an error. This is an approach implemented by Rust for example.
It is also a topic of various error handling technique has been discussed many times. My favorite take on it is using monads.
For the purpose of libstyxe design, while the discussion is still raging on, I wanted something easy to use and reusable. Maybe even composable?
Wouldn’t it be nice to write:
Result<void, Error> writeSomething(...) {
...
return encoder << my_value
<< my_other_value;
}
This operation chaining does look familiar to C++ developers. What is missing is conversion to a Result<>
As another experiment, I chose to add a different set of operator<<
overloads that take Result as an argument in addition to the classical
operator<<
:
Solace::Result<Decoder&, Error>
operator>> (Solace::Result<Decoder&, Error>&& decoder, Solace::uint64& dest);
This allows me to have operation chains and to return a result. Exactly as a target solution.
The only challenge was to add support to libsolace Result for returning Reference types.
Glad that one is done so now Result<Value&, Error>
is a valid type.
For a message parser and writer, the best error handling strategy seems to return an error to the caller. Safe for coding issues - there should be no places for panic
. That is no need to throw exceptions. After all, if all your function does is converting from user input to value - it is reasonable to expect that a user-provided input can be invalid. The only challenge is to keep a nice interface to allow operation chaining. Using good old free-standing
operator<<
and opeartor>>
fits the bill perfectly if these operators are allowed to return Result. That is each IO operation can fail.
This requires a set of overloads that accepts Result<Encoder&, E>
and Result<Decoder&, E>
.
This may not be the most elegant solution but it is extensible at the cost of an extra overload for each new type conversion added. `
I will write another post with the results of using this approach after a while.
Quite some time ago, I developed a habit of taking notes whenever I felt like I have an exciting idea. Which admittedly is not a very frequent occurrence. Turns out note-taking is a good exercise - kind of like swapping memory pages into long term storage. It frees your working memory.
It is also interesting to revisit notes from time to time. “what was I thinking back then?” or “Hey - this is not a bad idea at all!”. There are, however, downsides to taking notes - they do get lost. And it is never a good thing to lose good ideas. Even worse - no one else can see these ideas. Shocking, I know.
If only there were a way to store notes in a way that they were easy to share with the world? Maybe throw in version control too.
Oh! Of course - there is a technology for that :) So - here we are. Version controlled persistent and shareable memory dump, also known as a ‘blog’.
In this blog, I want to focus on Software Engineering in general and modern C++
in particular. With occasional musing about life in general.