Distributed Information in the Modern Era

Background

I work as an engineer in the “high performance messaging middleware” space. Most people don’t know what this means, and you would think after doing it for six years I would have perfected a convenient explanation by now. Alas, I have not.

Middleware messaging is software that other software developers use to enable “quality” communication between applications. Quality in each use case varies significantly because sometimes applications need to send lots of data but speed isn’t very important. In other use cases speed, or latency, of the messages is the most important aspect, even more important than the messages themselves (that might be hard to grasp, but since it is not the point of this article I will not bother explaining it right now.)

If we wanted to depict what messaging is in its simplest form, it looks like this:

Sources send messages, receivers receive the messages, and the space between the source and the receiver is the send path. Inside the message is a “payload” which is the content of the message the receiver needs, either for informational purposes or possibly an instruction to perform an operation.

Sometimes messaging systems are brokered. A broker is a piece of software in the send path between a source and receiver. It has advantages like being able to “fan-out” the data (sending it to multiple receivers), “guarantee” the messages by storing them to a redundant storage mechanism, and in some cases the broker can even transform or modify the payload based on a set of rules. A brokered system looks like this:

Seems simple, and at a high level it is quite simple. However, real world implementations are almost never simple. Messaging systems exist in complicated computer networks where the system is often times taken for granted to be able to properly publish a consistent quality of service. For example, a lot of send paths look like this:

And this is still abstracting a lot of the ugly stuff. Most networks have many layers of switches, and routing is usually handled by big beefy servers. And firewalls are like the TSA of networks – they slow everything down, good and bad messages alike. And to complicate things even more so, there are multiple “protocols” in which messages are relayed on, things like TCP and UDP. Protocols are like different languages for the computers and networking equipment in the system. There are standards in how these protocols are supposed to be implemented, but depending on the the actual hardware/software, the implementation can still vary.

Oh and the Internet, let’s not even go there. The cloud that makes up the Internet is understood by very few people, maybe no one in its entirety.  At a very high level, it’s a lot more switches and routers, maybe a dozen or maybe a hundred from point A to point B.

Point is, networks are complicated.

Moving back to messaging, messaging itself is not without its complications. For example, messages can get “lost”. Let’s say a source publishes three messages and the receiver only gets two:

This happens. Often. Like so often that if it did not happen I probably would not have a job. Why did it get lost? It could be a lot of different reasons; a disconnect in the network somewhere, or the message is dropped intentionally by some piece of networking equipment because of higher priority traffic, solar radiation changed bits on a copper wire (this really happens), or maybe because the software in the switch had a bug in it to drop all UDP packets where there was a 1 in the 17th byte of a packet (this not only can happen, I’ve seen it happen). Most commonly the explanation is much simpler. More times than not, a message gets lost simply because the receiver can only receive so many messages in a given period of time, let’s call this X messages per second. If a source tries to send X+1 messages in a second, the receiver simply fails to receive the +1.

Messaging software also has some fancy mechanisms built in, things like “filtering”. Filtering can happen at the source or receiver, and it basically means that message are intentionally discarded based on some kind of pre-configured criteria:

Messaging systems also have configurations for making sure only certain receivers get a “special” message, and not to waste time sending some messages to all receivers that the source thinks it wouldn’t be interested in:

And finally we have brokers which are not only capable of lost messages and special messages but can also modify messages based on specific configurations and change the data before the end receiver gets the message:

Messy stuff. At a high level, this is the complexity of messaging in distributed systems. Many sources are sending many messages, sometimes direct to receivers and other times indirectly via brokers to many receivers, all the while dealing with the inter-tangledness of large networking infrastructures that speak in different protocols and different implementations of the same protocols while messages are explicitly and implicitly being filtered at every node along the send path. It’s safe to say no single node on any of these systems has a complete and honest view of the rest of the system, and that is largely by design because a single node would not be able to process the vast amounts of data that typically flows on large networks.

The Problem

All right, here’s comes the metaphor now; our world of news and information sharing works largely in the same way as distributed systems, just on a much larger and less reliable scale. We have our sources and receivers:

And we have our brokers:

The same problems we have with distributed messaging we have with our information gathering. Sources (governments) filter their messages when delivering them to both end receivers (us) and brokers (press), and brokers transform messages to appeal to a specific subset of receivers, while other nodes aren’t even listening:

Source-to-broker-to-receiver seems to be expanding as each year passes as information publishing gets more economical. Most people used to get their news from a direct broker of information, and even though the messages were filtered, it was at least mostly accurate information. Nowadays, people are subscribing to brokers that get their data from other brokers, further modifying messages from the original:

That’s not to say that the brokered-brokers are always factually inaccurate (they are sometimes), they are heavily modified though. This is not inherently bad, but the big problems exist at the receivers themselves. Us. We have receive side filtering that will discard data based on emotional responses and subconscious biases, particularly if there is outright distrust of the source:

We receive our news and our talking points from more and more filtered sources and brokers, to the point that we’re building our knowledge of the system from fragmented and incomplete data. Then to fragment the data even more, some receivers get information from other receivers that have an incomplete view of the system:

If you think back to the networking example, receivers cannot receive all the data in the system because there is just too much data. Our world is no different, in fact it is even worse. Each source sends messages on a specific topic and the more topics we subscribe to the more messages that get received, and the more messages that get received, and the more prone to loss we are. It’s like trying to receive 10 gigabits of data on a 1 gigabit link – it’s just not physically possible:

This is obviously an over-generalization of the world today. Some people dedicate more time to gathering and processing information and therefore can construct more valuable insights than others, but that takes patience and time. Some people are more honest with themselves, can recognize their biases and try and broaden their spectrum of sources despite the strong emotions to ignore. It’s not easy though. It requires work, it requires a certain amount of dedication, and it requires the emotional restraint to not jump to the conclusions some sources are pushing you to make.  

We all see the noise on social media, the punditry masked as news in our faces 24/7 by friends and family. Some of us are even guilty of sharing the noise without thinking about the possibility of other sources or failure to research if they even exist. It’s easy to be cynical, to lose trust in the system because of this. It’s important to realize that each node, each source, each broker, and each receiver is independent of each other. Yes, governments may try and bend messages in a specific way to shape a narrative. That’s actually why we have multiple independent brokers (press) with access to stay close and keep the sources honest. It’s in their interest to sniff out the noise because the broker that uncovers the truth gets to break the story. And yes, there may be collusion between sources and brokers, but that’s why it’s important to have a broad field of independent brokers to uncover any injustice in the system. No system is perfect.

A more accurate depiction of our distributed network of information publishing probably looks more like this:

connectivity

We are all interconnected in one way or another yet we maintain a deafness to the majority of information. We have internal filtering mechanisms and if you compound that with broker based news publishing along with broker filtering and transforming, it is no wonder so many nodes on our network are misinformed. And with so many nodes getting data from misinformed nodes, the issue becomes cascading to the point where some nodes do not trust any of the data from brokers or original sources.

The Solution

Let’s circle back to the messaging analogy temporarily. Like I said, I spend a lot of time helping users diagnose and fix lost messages in a distributed messaging system. Many times users do not even know it’s a loss problem; they just know they have applications generating bad data and do not realize why. To fix any messaging problem, one must address the loss first and foremost, and there are generally two simple solutions:

  1. Slow down the sources / subscribe to fewer sources
  2. Speed up message processing by moving unprocessed messages to a queue to be processed later.

The same principles apply to information gathering for us. We cannot possibly keep up with all the news and information being published on a daily basis, it is just not physically possible. Therefore we are oftentimes left with incomplete information because we get broken up information from multiple sources or we get filtered data from a single source.

If a topic interests you, rather than just subscribing to a single broker, or a particular subset of brokers, subscribe to all of them on that particular topic. That is still going to be a lot of data, however if you queue it and take your time processing it you will be a better informed node in the system. Do not stop at the most recent messages on the topic either – go back and retrieve the history of that topic. The more data you can gather and process, and then share with your surrounding nodes the better off we’ll all be. Imagine a network where every node was well informed on a specific topic and only shared what it knew about that topic?

Obviously we do not live on that world. Instead we live in a world where few of us are experts on topics where we have strong opinions on said topic. There is nothing wrong with that as long as we understand that fact that our viewpoint is incomplete and lacking data. The smartest thing we can say in this case is “I don’t know.” Too few of us do not know that we do not know.   

It’s a lot to ask, to maintain due diligence before passing judgement on a talking point that people are passionate about. We have busy lives with family, jobs, social obligations, and even trying to maintain our own mental and physical status; it’s hard to find the time to be open minded. However, if we do, if we take the time to consider alternatives, to research, to not share mindless information that intentionally provokes, we would all be better off.

Facebook Comments