AVL Tree, doubly linked lists and iterations
Over the past few months I’ve been mulling over a Network Performance and Intrusion Detection box that could sit on a network and make route suggestions to the border.
There are some very basic CompSci fundamental processes involved in the data analysis to find the source and preferred path for each incoming packet. When you’re dealing with flows and traffic capture of multiple gigabits per second, you need something that is very quick.
The problem we have is the representation of an IP address in memory. Obviously we can use a tuple to store the octets or words, but, the Internet isn’t fully allocated and there are a lot of empty netblocks.
Since the data is sparse, and we need to know whether a packet is in a netblock that we’ve seen, the ASN, and the preferred path we have a small problem.
Our first issue comes from the fact that we might see an announcement of:
which contains all of the IP addresses from 184.108.40.206 to 220.127.116.11. So, large block announcements (and overlapping announcements) become a potential problem. Currently, there is an announcement for 18.104.22.168/8 and 22.214.171.124/9. Any IP address in 126.96.36.199/9 (over 8.3 million) is handled by one announcement but also contained within 188.8.131.52/8.
With this in mind, we’ve eliminated the ability to do a binary search as our list would be quite imbalanced. We could store the individual tuples in a tree which would be three levels deep, but, there are portions of the internet filled with /24s that would be quite dense. Since those portions of netspace are not only densely packed, but quite active, we know we’re going to be spending a lot of time traversing those nodes.
Back to the issue at hand, we have an IP address, 184.108.40.206/32 that we need to find the source announcement for. Assuming for the moment we ignore overlapping address announcements, we need to find that the IP address is announced in 220.127.116.11/18, originates from ASN306 and which of our current peers is the ‘best path’ according to BGP4.
BGP4 deals with routes based on shortest path. Each hop through a provider adds an ASN to the path so that we know how to get to a destination. While the shortest ASPATH seems logical that it would be the fastest path, that isn’t always the case. There are times when a packet might traverse a provider’s backbone that has some congestion where avoiding that provider could result in faster connection/transfer speeds for that destination. This happens much more frequently than you would think where providers dump traffic on free exchanges rather than paying for the transit and running those connections at 95% capacity during peak hours. Performance analysis is a separate issue, we just need to find the most specific announcement for a particular destination.
There are several data structures that come to mind:
* Sorted List – iterate through using a number of different methods sequentially, binary search, etc.
* Red/Black tree – Faster than binary, but, route announcements can be withdrawn and we might have some deletions leading to a more imbalanced tree.
* AVL tree – Faster than red/black, more evenly balanced, great for lookups, slight performance penalty on deletions, but, we don’t delete too many items too often.
So, the AVL tree probably wins in this case, but, we’re not comparing for equality. We need to find the IP address within in announcement, and while memory is cheap storing every single IP address and an associative pointer to the source announcement would be the fastest if we used a simple hash lookup, we need to modify our AVL tree lookup slightly. With IPv4, if we used only the first three octets, we would only have 16.4 million entries if we represented each netblock with an associative pointer to the source. With IPv6, our same hash would need 1.84×10^19th entries. In either case, populating that full hash would take a bit of time.
We need to find the node of the tree that has a value greater than our current IP address, then, take the previous node and verify that the IP address is within that announcement. It should be if our routing table is complete, but, if not, it could be a ‘martian’ or an announcement to a location that would be returned to our default route.
Since we’ve chosen Erlang, it does have a Generally Balanced Tree built in which has some decent performance improvements over a generic AVL tree, but we need to modify the traverser slightly for our use case. Luckily we can do this.
I raise this issue because I read a blog post about a student claiming he’d never put anything he learned in college to use. I think he’s just not solving the right problems. I learned about trees, sorts, linked lists in college and still put things like this to use today.
I remember back in the mid ’80s I was attending UMBC to get a degree in Computer Science. Since I had computer programming experience, I was able to take two experimental courses to replace the general CompSci track – introducing me to Ada by one of the few experts at the time. However, a friend that was taking CS270 – Pascal mentioned his teacher was giving a discussion on doubly linked lists and a new method of traversal he had come up with. As I lived on campus and my advisors had completely screwed my class schedules, I had a lot of free time between classes and opted to attend the class. Of course, the teacher decided to give a pop quiz that day prior to the lecture which I passed over to my neighbor. The TA asked why I wasn’t taking the quiz and I said, I’m not in the class and was here to hear the lecture. This raised a small commotion which brought the professor over to ask why I refused to take the quiz. I explained that I had heard through a friend he was giving a lecture on doubly linked lists with a different method of traversal and I was interested in hearing the discussion – I wasn’t a member of the class and if he preferred, I would leave, but, I wanted to hear the lecture.
I think something happened – this teacher had a reputation for being very intelligent but very gruff. He asked me to come with him to the front of the class and said,
Ladies and Gentlemen, this is Chris Davies. He doesn’t attend this class, but, wanted to hear my lecture. Here’s a student that actually wanted to be here to learn while you’re all disappointed with having to show up to hear me talk.
He gave his lecture and we did talk a few more times after that about data retrieval and data structures.
Yes, twenty-six years later, I am still using the education I paid for.