With a clear definition of what to collect and who the user is, the question of how to collect data records becomes relevant. Common terms are meter and metering. The term meter describes a measuring process, even though a more precise definition is required for accounting purposes. The definition of meter used in this book describes the measurement function in the network element or in a dedicated measurement device. Metering is the process of collecting and optionally preprocessing usage data records at devices in the network. These devices can be either network elements with integrated metering functionality or a dedicated measurement device ("black box") that is specifically designed as a meter.
The following details need to be considered for metering:
Meter placement, at the device interface or the central processor
Unidirectional or bidirectional collection
Collection accuracy
Granularity, which means aggregating packets into flows or aggregating multiple meters into a single value
Collection algorithm, which means inspecting every packet with a full collection, or only some packets with sampling
Inspecting the packet content for selection with filtering
Adding details to the collected data sets, such as time stamps and checksums
Export details, such as protocols, frequency, compression, and security
You can distinguish between two major monitoring concepts:
Passive monitoring— Also referred to as "collecting observed traffic," this form of monitoring does not affect the user traffic, because it listens to only the packets that pass the meter. Examples of passive monitoring functions are SNMP, RMON, Application Response Time (ART) MIB, packet-capturing devices (sniffer), and Cisco NetFlow Services.
Active monitoring— Introduces the concept of generating synthetic traffic, which is performed by a meter that consists of two instances. The first part creates monitoring traffic, and the second part collects these packets on arrival and measures them.
Note that both instances can be implemented in the same device or at two different devices.
A simple example of an active test is to set up a phone call to check if the destination's phone bell is operational. The Cisco IP SLA feature is an instantiation of active measurement. The main argument for passive monitoring is the bias-free measurement, while active monitoring always influences the measurement results. On the other hand, active measurements are easily implemented, whereas some passive measurements, such as the ART MIB, increase the implementation complexity. Table 2-16 summarizes the pros and cons of both approaches. Best practice suggests combining active and passive measurements to get the best of both worlds.
Passive monitoring concepts are categorized into two groups:
Full collection— Accounts all packets and performs various operations afterwards.
Partial collection— Applies sampling or filtering to select only some packets for inspection.
In both cases, you can store either packets or flows, which leads to the definition of the two terms. Packets refer to individual instances, without identifying a relationship between them. Flows consist of packets related to each other—for example, because they belong to an exclusive data session between a client and a server.
A major distinguisher between different passive monitoring techniques is the unidirectional versus bidirectional type of collection. Unidirectional concepts, such as Cisco NetFlow, collect traffic in one direction only, resulting in multiple collection records (for example: record 1 is source
destination; record 2 is destination
source). These have to be consolidated afterwards. Other technologies, such as RMON and the ART MIB, measure traffic in both directions and directly aggregate the results at the meter. At first glance, the bidirectional method seems more practical. Unfortunately, it cannot be applied in networks with asymmetric routing (which means that the return traffic takes a different route) or load sharing across multiple devices.
Full collection processes every packet and therefore guarantees that the output of the metering process is always exactly equal to the data passing the network element. Accuracy is the main advantage of a full collection! Major disadvantages are the large number of generated data records and the performance overhead caused by the collection. Full collection concepts are implemented for packet collection as well as flow collection. Various collection technologies have different performance impacts. For example, updating the device's SNMP interface counters consumes fewer resources than collecting NetFlow records. A clear distinguisher between various collection methods is the ability to differentiate applications in the data records. For example, SNMP interface counters collect only the total number of packets. They do not identify a relationship between packets of a type of application, which is also called stateful collection. Stateful collection identifies the associations between packets that belong to the same session (such as ftp file transfer) and does it bidirectionally: from source to destination and destination to source. In the case of TCP sessions, NetFlow implements a partly stateful flow collection, because flows are identified by start (SYN) and stop (FIN or RST) flags. It is not completely stateful, because no bidirectional correlation exists. Another full collection technique is the ART MIB, which is an extension of RMON and proposes a transactional method. Instead of the stateless collecting approach taken by RMON, ART identifies the start of a session, creates an entry for this transaction, and monitors the network for the associated return packet. The elapsed time between the initial packet and the response is measured and stored in the MIB. ART can identify all TCP applications as well as two protocols on top of UDP: NFS and SNMP.
The increasing speed of interface technologies forced the development of alternative technologies to a full collection—for example, filtering and sampling. Today, Fast Ethernet is the default interface speed for a PC, whereas workgroup switches have multigigabit uplinks and optical WAN links that drastically increase transmission capabilities. To avoid CPU and memory resource exhaustion in the network elements and to avoid overloading the collection infrastructure, new sampling concepts are required. In the future, full collection methods such as NetFlow will not be scalable for high classification techniques; instead, they will require dedicated devices. For network elements, the proposed solution is to focus on sampling techniques, which is the reason for the in-depth analysis of sampling techniques in this book. Using sampling techniques for billing introduces a paradigm change, compared to the legacy world of SS7. Instead of applying "Don't forward traffic if you can't bill it," the new paradigm can be described as "First, forward traffic as fast as possible and apply billing as the second instance."
The definition of sampling used in this book is as follows:
Sampling is the technology of selecting a subset (the samples) of the total data set in the network, typically for accounting purposes, with the aim that the subset reflects most accurately the characteristics of the original traffic.
Another analogy would be a puzzle. How many pieces do you need until you can identify the full picture that the puzzle depicts? You have probably done this, so you know that not all pieces are required. Sampling in this case means assembling just enough puzzle pieces to envision the big picture. Therefore, the idea is to select only "important" packets and ignore "unimportant" packets. If you want to relate this to the puzzle analogy, you need to collect only those puzzle pieces that shape the object enough so that you can recognize the full picture.
An alternative technique to sampling is filtering, which applies deterministic operations based on the packet content. Whereas sampling can depend on the packet's position in time, space, or a random function, filtering is always a deterministic operation.
The definition of filtering used in this book is as follows:
Filtering is a deterministic operation that differentiates packets with specific characteristics from packets without these properties.
To follow the puzzle example, you apply a filter when you select all border pieces first. The filter criterion in this case would be "Select pieces only if one side has a straight line." After building the frame, you would probably select significant middle pieces with contrast and pictures, not pieces that are completely blue, such as those from the sky or the ocean. Figure 2-7 demonstrates the results of sampling and filtering.
Two terms are commonly used in the area of sampling and filtering:
Parent population describes the original data set from which samples are taken.
Child population describes the remaining data set after sampling, which is the sample.
The objective of sampling is to have the child population represent the parent population characteristics as exactly as possible; otherwise, the collection is biased and most likely of less use.
Shifting the focus back to the networking environment, it is advantageous to leverage sampling, especially on high-speed interfaces of the networking devices. The sampling process selects a subset of packets by either applying deterministic functions to the packet position in time or space or applying a random process. Deterministic sampling selects every nth packet or packets every n seconds, for example. An example of random sampling is selecting one out of 100 packets based on a random algorithm or a packet every 5 ms. Figure 2-8 illustrates random and deterministic packet sampling.
Sampling compared to a full collection provides advantages for the metering device, the network that transports the data sets, and the application that processes the data afterwards:
Meter— Sampling increases the scalability to collect traffic at high-speed interfaces. Processing all packets becomes increasingly difficult.
Transport network— Sampling can reduce the data export from the meter to the collection server.
Application— A smaller data set reduces the required processing power at the mediation and application server.
After deciding to sample traffic, consider the sampling rate and the required accuracy, which is also called the confidence interval. If the sampling rate is too low (which means undersampling), the child population is less accurate as required by the confidence interval and does not correctly represent the parent population traffic received at the device. If the sampling rate is too high (oversampling), you consume more resources than necessary to get the required accuracy. For a better understanding of the different sampling techniques, the following structure applies in this chapter.
A relevant concept is the sampling strategy, such as deterministic (or systematic) and random (or pseudo-random, probabilistic) sampling. Deterministic sampling involves the risk of biasing the results if a periodic repetition occurs in the traffic that exactly matches the sampling rate or a multiple of it, as illustrated in Figure 2-9. Biasing reduces the match of the child and parent population and is counterproductive to the objective of an accurate matching between the two. Unfortunately, a periodic repetition of events in the observed traffic might not be known in advance, so a criterion for a "good" sampling algorithm is to have a good mixture of packets, and that is the starting point to consider random sampling. In other words, the deterministic sampling model is sufficient when the observed traffic does not contain any repetitions, which typically applies at high-speed interfaces. Random sampling is slightly more complex than deterministic sampling, because it implies the generation of the random numbers. However, random sampling increases the probability for the child population to be closer to the parent population, specifically in case of repetitions in the observed traffic. Best practice recommends using random sampling.
A number of research publications address this topic; consequently, it is sufficient to have just a simple example, as illustrated in Figure 2-9. In this case, traffic consists of two flows, and packets from each flow arrive in round-robin order. In the case of deterministic sampling 1-in-4, packets from only one flow are selected for processing, and random sampling "catches" packets from both flows. The reason is that the inverse of the sampling rate (4) is a multiple of the traffic repetition (2).
Another concept is packet sampling versus flow sampling, which applies for both random and systematic sampling:
Packet sampling selects packets according to an algorithm and may combine multiple packets into a flow. In this case, flows are created based on the subset of packets that were collected from the sampling instance. In other words, packet sampling occurs first and is optionally followed by flow aggregation.
Flow sampling takes a different approach. It starts with a selection of all packets and applies algorithms to merge them into flows, which results in a full collection of the original traffic (or parent population). Afterwards, flow entries in the cache are sampled either randomly or systematically, based on criteria such as largest flows, shortest flows, flow duration, and so on. In other words, aggregating packets into flows happens first, followed by flow sampling.
Even though packet and flow sampling are described separately, both techniques can be applied in conjunction. For example, you could sample packets first and then sample the aggregated flows afterwards to export only a subset of the total number of flows.
The first algorithm to examine is deterministic sampling, also known as periodic or systematic sampling. This sampling algorithm can be systematic count-based (for example, sample every 100th packet), systematic time-based (such as sample every 10 ms), or systematic size-based (select only packets whose length meets a certain criterion, such as 100 bytes).
These schemes are easy to implement and sufficient for applications such as performance management that require less accuracy than applications such as billing, which have high accuracy requirements. A valid concern related to the systematic approach is the dynamic nature of the traffic, which for a given confidence interval may result in inaccurate undersampling or excessive oversampling under changing traffic conditions. In general, the higher you sample, the better the results are, but there is no need for overachieving the confidence interval that you defined originally. Unfortunately, no mathematical model exists to describe deterministic sampling, which means that there is no mathematical proof that deterministic sampling is not biased. Empirical observations have shown that in high-speed network environments, the traffic is sufficiently mixed so that no repetitions of any kind exist. Consequently, there is no risk of biasing by always selecting the same type of traffic in the parent population, if by chance the sampling rate is a multiple of the traffic repetition rate. However, to be on the safe side, you should not select random sampling techniques for applications such as usage-based billing.
Also known as periodic fix-interval sampling, it is a relatively simple count-based algorithm that allows the selection of one packet out of every N packets. You configure a value for N in the meter. Then you multiply the volume of the accounting records at the collection server by the same factor, N, to get the estimated total traffic volume. This is useful for network baselining and traffic analysis, even though the accuracy cannot be determined exactly and the results might be biased.
Example: N = 100
Result: sample packets 100, 200, 300, 400, 500, ...
Effective sampling rate: 1 percent
Note that NetFlow supports deterministic packet sampling, but it calls the feature "systematic packet sampling."
Note
For more details on Sampled NetFlow, refer to http://www.cisco.com/univercd/cc/td/doc/product/software/ios120/120newft/120limit/120s/120s11/12s_sanf.htm. xxx
The 1 in N packet sampling scheme can be extended to collect multiple adjacent packets at each collection interval. In this case, the interval length defines the total number of packets sampled per interval, while the trigger for the operation is still counter-based. Collecting a number of contiguous packets increases the probability of collecting more than one packet of a given flow. Two parameters define the operation:
The packet-interval parameter is the denominator of the ratio (1/N) of packets sampled. For instance, setting a packet interval of 100, one packet out of every 100 will be sampled.
The interval-length statement defines the number of samples following the initial trigger event, such as collecting the following three packets.
Example: packet-interval = 100, interval-length = 3
Result: sample packets 100, 101, 102, 200, 201, 202, 300, 301, 302, ...
Effective sampling rate: 3 percent
The schemes described so far use the packet position (also known as "spatial") as the trigger to start the sampling process. Alternatively, a trigger can be a clock or timer, which initiates the selection in time intervals, such as every 100 ms. The stop trigger can also be a timer function, such as one that collects all traffic during an interval of 5 ms. Because you cannot determine in advance how much traffic occurs at the meter during the measurement time interval, three situations are possible:
The accuracy of the child population matches the defined confidence interval; in this case, the sampling rate is correct.
The accuracy of the child population is lower than required, which means that additional samples would be necessary to match the confidence interval. Undersampling describes the situation in which not enough samples are available to offer the required accuracy.
The accuracy of the child population is higher than required, which means that more samples are selected than needed to match the confidence interval. Oversampling describes the situation in which a smaller number of samples still provides a correct result.
Figure 2-10 illustrates this effect. The solid bars represent required sampling, and the open bars show oversampling. Undersampling is illustrated by the encircled arrows.
In the specific example of Figure 2-10, traffic is considered as flows or unidirectional conversations, which have a define start and stop time, and the goal is to identify the occurrence of each individual flow. Packets are collected at fixed time intervals, indicated by the bullets and vertical bar. Note that the figure explains a conceptual sampling scenario; it does not describe how packet sampling is implemented. For example, do not assume that four flows are captured in parallel.
During interval t1 – t2, undersampling occurs because not all flows are collected. The missed ones are encircled by the dotted lines. To solve this problem, you need to increase the sampling interval. In interval t2 – t3, oversampling takes place, and more traffic is collected than required to gather all flows (redundant sampling points marked with a pattern). In this case, you could decrease the sampling interval without loosing a flow. Best practice suggests that the selected sampling rate should be a compromise between these two extremes. To avoid empty collections and to gather at least a minimum number of samples, an alternative approach is to combine the time-based start trigger with a packet counter (N) or a content-based function. In that case, you start the collection every n ms but do not stop unless N packets are collected. Afterwards the meter idles for n seconds and starts again. Instead of collecting N packets in a row when the trigger starts, the meter can select those N packets by applying some random selection, or even by applying a filter to match certain traffic criteria.
In the example in Figure 2-10, it is possible to identify over- and undersampling and therefore define the "right" sampling interval. In reality, there is no perfect answer for how to derive an appropriate sampling rate from a given confidence interval, or how to compute an appropriate confidence interval for a given application. This makes it hard to identify the "right" sampling rate.
A different deterministic approach is to collect packets or flows based on their size. From a monitoring perspective, you can be interested in analyzing the traffic's packet size to draw conclusions from the packet size for specific applications, such as security monitoring, or for general planning purposes.
Instead of reporting the exact size per packet, a simple aggregation approach is to define buckets of packet size at the device and aggregate packets into these buckets. This aggregation provides statistics about the packet size distribution but does not supply additional traffic details, such as source or destination. A benefit of this method is the simplicity, because only three collection buckets need to be implemented. Here's an example of the different packet size buckets:
Packet size < 64 bytes
Packet size between 64 and 200 bytes
Packet size between 201 and 500 bytes
Packet size between 501 and 1000 bytes
Packet size > 1000 bytes
The size-based sampling concept can also be applied to the size of a flow, which can be the total number of packets for a flow or the total number of bytes for a flow.
Previously, we defined a flow as an aggregation of packets with common key fields, such as source or destination IP address. The flow size is indeed a relevant indicator of communication characteristics in the network, as you will see in the following examples. A huge number of very small packets can indicate a security attack. Instant-messaging communication usually generates a constant number of small packets; the same applies for IP telephony. File transfer sessions use the maximum packet size and transmit a large volume in a short time period.
Note
An interesting article on the subject of Internet traffic is "Understanding Internet Traffic Streams: Dragonflies and Tortoises" by N. Brownlee and KC Claffy (http://www.caida.org/outreach/papers/2002/Dragonflies/cnit.pdf).
Size-based flow sampling works as follows: during time interval T, all packets are aggregated into flows. At the end of interval T, size-based flow sampling selects only flow records with a large volume for export, either in number of packets or in number of bytes. This method reduces the number of data sets exported from the meter to the collection server and trims the required processing cycles at the meter, because only a subset of entries from the cache are exported. Tables 2-17 and 2-18 describe size-based flow sampling. Both tables contain flow entries; the major difference is the sequence of the flow entries. Table 2-17 shows a flow table, where flows are added based on their creation time. Table 2-18 displays the same entries, but this time sorted by the flow size (number of packets). Consider a function that lets you select the Top-N entries, where you define a packet size threshold (such as number of packets) and export only those flows above the threshold. For Table 2-18, a threshold of 10,000 packets would result in exporting flow entry numbers 995 and 4.
| Flow Entry | IP Source Address | IP Destination Address | Packets | TOS | Source AS | Destination AS |
|---|---|---|---|---|---|---|
| 1 | 10.1.1.1 | 10.2.2.2 | 5327 | 160 | 13123 | 5617 |
| 2 | 1.2.3.4 | 1.1.1.1 | 294 | 176 | 13123 | 3320 |
| 3 | 10.61.96.101 | 171.69.2.78 | 15 | 0 | 0 | 5617 |
| 4 | 171.69.2.78 | 10.61.96.101 | 10382 | 0 | 13123 | 3215 |
| 5 | 144.254.71.181 | 10.1.1.3 | 816 | 13123 | 1668 | |
| ... | ||||||
| 995 | 10.2.1.5 | 10.64.30.123 | 290675 | 176 | 13123 | 7018 |
| 996 | 144.210.17.180 | 171.71.180.91 | 8410 | 0 | 21344 | 5617 |
| 997 | 10.1.1.1 | 10.2.2.2 | 481 | 160 | 13123 | 22909 |
| ... |
A valid concern of deterministic sampling is the potential biasing of the collection results. Related to sampling, the objective for the child population is to represent the parent population accurately. The term "random sampling" implies that for a finite quantity each member has the same chance of being selected. The random samples should be representative of the entire parent population. Algorithms that meet this requirement are also called pseudo-random-number generators. From a mathematical perspective, random sampling can be represented in a model, whereas deterministic sampling can only be investigated in an empirical manner. This is the main reason for the selection of random sampling versus deterministic sampling in all situations where determination of the accuracy is relevant.
In this sampling mode, an average of 1-out-of-N sequential packets are randomly selected for processing (where N is a user-configurable integer parameter), ultimately providing information on 100/N percent(s) of total traffic. The notation 1:N is used to describe sampling of 1-out-of-N packets. For example, if a random sampling of 1:100 is configured, the algorithm randomly selects one packet out of each 100 sequential packets, providing information on 1 percent of the total traffic. Figure 2-11 illustrates 1-out-of-N sampling with a sampling interval of N = 5, so a random selection of one packet within a set of five packets takes place. Cisco NetFlow implements random sampling; it is called Random Sampled NetFlow.
Note
For more details on Random Sampled NetFlow, check http://www.cisco.com/en/US/products/sw/iosswrel/ps5207/products_feature_guide09186a00801a7618.html.
A modified version of the 1-out-of-N sampling is n-out-of-N mode. It is similar to random 1-out-of-N sampling just described, except this time, not just a single packet gets collected, but several packets. If you configure n = 100 and N = 10,000, you randomly sample 100 nonconsecutive packets at random positions within the 10,000 packets.
In contrast to random packet sampling, where the random factor applies to the packet selection, random flow sampling takes a different approach. The meter aggregates the observed traffic into flows, based on the defined key aggregation fields, such as source or destination address, port number, or device interface. This aggregation step can be applied to all packets, in case of a full collection, or can be applied after packet sampling. Random flow sampling is accomplished afterwards, where not all flows are exported, but only a subset of flows, based on the random factor. Instead of defining a random factor for packets, you define a factor for 1-out-of-N flows. Figure 2-12 illustrates the four steps of random flow sampling:
1. | Collect every packet or alternatively sample packets first. |
2. | Aggregate packets into flows, and create entries in the flow table. |
3. | Randomly select a number of flow entries from the table. |
4. | Export only the selected flow entries, and clear the table. |
Probabilistic sampling describes a method in which the likelihood of an element's selection is defined in advance. For example, if you toss a coin and select a packet if only the coin shows heads, the selection chance is 1 out of 2. If you cast a die and select a packet if one dot is displayed, chances are 1 out of 6 that a packet will get chosen. Probabilistic sampling can be further divided into a uniform and nonuniform version.
Uniform probabilistic sampling uses a random selection process, as described with the coin and dice examples, and is independent of the packet's content. An example of uniform probabilistic sampling addresses flow sampling: Most of the time you want to export the flows with a high volume because these are the most important ones. The solution is to export all large flow records with a high probability, while the small flow records are exported with a low probability, such as proportional to the flow record volume.
Nonuniform probabilistic sampling does not use a random function for packet selection; instead, it uses function based on the packet position or packet content. The idea behind it is to weight the sampling probabilities to increase the likelihood of collecting rare but relevant packets. Imagine that you want to select routing protocol updates to identify changes to the paths in your network. Compared to user traffic, these packets represent a minority of the total traffic but are important to meet your objective.
For the sake of completeness, the theoretical aspects of stratified sampling are highlighted next. Stratified sampling takes the variations of the parent population into account and applies a grouping function before applying sampling. Stratification is the method of grouping members from the parent population with common criteria into homogeneous subgroups first; these groups are called strata. The benefit is that a lower sampling rate per strata is sufficient to achieve the same level of accuracy. For example, if a sampling rate of 1 out 10 is required to achieve a certain confidence interval, after grouping by strata, the same goal could be achieved by a sampling rate of 1 out of 20. The key to successful stratification is to find a criterion that will return a stratification gain.
Two requirements are relevant for the selection process:
Comprehensiveness— Every element gets selected; none can be excluded.
Mutual exclusiveness— Every element has to be assigned to exactly one group (stratum).
Referring to Figure 2-7, child populations A, B, C, and D are taken from the parent population and are grouped according to their characteristics. After the packets are grouped, sampling techniques are performed on each stratum individually, which means that different sampling algorithms can be applied in parallel. Stratification also achieves the same confidence interval with a lower sampling rate.
A practical illustration is first to classify traffic per application (such as HTTP, FTP, Telnet, peer-to-peer, and management traffic) and then sample per group (stratum). This method is useful to correct the allocation of variances in the parent population.
For example, the volume of web-based traffic on a link is 10 times the amount of Telnet traffic. Assuming that you want to sample packets, the child population should contain the same volume of HTTP and Telnet packets, possibly for packet content analysis. If you apply sampling across the mixed traffic, a higher sampling rate is required to select enough Telnet packets, due to their small occurrence, while a lower sampling rate would be sufficient for HTTP. If you group (stratify) the traffic first into a stratum of HTTP packets and then into a stratum of Telnet packets, the same sampling rate can be applied to both groups.
Filtering is another method to reduce the number of collection records at the meter. Filters are deterministic operations performed on the packet content, such as match/mask to identify packets for collection. This implies that the packet selection is never based on a criterion such as packet position (time or sequence) or a random process in the first place.
Three steps are applied for filtering. As the first step, you define "interesting" packets, which are the selection criterion for the collection process. One example is to filter packets based on selected IP or MPLS fields; another example is filtering based on the packet's QoS parameters. A final example is the matching of IPv4 and IPv6 header types that provide the operator with adequate information during the transition phase from IPv4 to IPv6. A practical implementation for selecting packets is the use of Access Control List (ACL) match statements. Step 2 selects either full packet collection or sampling operations. Step 3 exports packets immediately or aggregates them into flows before exporting. Figure 2-13 shows the various alternatives and combinations of filtering and sampling.
The combination of filtering and sampling is a very efficient approach to dealing with the increasing traffic volume in the networks. Instead of choosing between full collection and sampling, you can apply the preferred methodology based on traffic types. If a network already has service classes defined, these classes can act as the traffic distinguisher for collecting accounting and performance data records. Figure 2-14 shows three different traffic classes:
Priority traffic— A network operator requires detailed accounting records from the priority traffic for billing purposes, so a full collection is configured.
Business traffic— Business traffic needs to be monitored closely to validate SLAs, but it is not charged, so a sampling rate of 100 is acceptable.
Best-effort traffic— The best-effort traffic class is provided without any SLA; therefore, a basic collection with a sampling rate of 1000 is adequate.
These requirements can be fulfilled by deploying a combination of filtering and sampling.
A sophisticated design is to continuously collect all packets under normal circumstances and apply sampling and filtering during "special" situations, such as very high utilization or security attacks. During a DoS attack, it is critical to collect detailed traces of the specific portion of the traffic that is related to the attack, such as if a significant fraction of the traffic has the same destination address.
Note that NetFlow supports input filtering mechanisms, under the name "input filters."
Filters described so far take actions on the packet content. An alternative are filters based on the router's state. For instance, a violated ACL can trigger a collection of flows for a security analysis. Alternatively, traffic that matches a certain BGP AS number or range of AS numbers can be metered, while traffic from other AS numbers is not metered.
Whereas passive collection methods are based on the concept of not affecting the live traffic on the network, active monitoring applies the exact opposite paradigm. Specific synthetic traffic is generated and the results are collected to indirectly measure the performance of a device, the network, or a service. This section describes how to generate and meter synthetic traffic.
Certain conditions are to be met when actively measuring the network and creating test traffic:
The characteristics of the user traffic must be represented, such as packet size and QoS parameters.
The ratio of test traffic compared to the total capacity is required to be relatively low. Best practices suggest a maximum of 1 percent test traffic compared to production traffic.
Test traffic must not be blocked by security instances, such as ACLs, firewalls, or proxies.
Devices must treat the test traffic exactly like any other traffic (for example, a router must not process test traffic in software if other traffic is processed in hardware).
The start time of the operations should provide a random component to avoid biased results. (For example you can define an operation to occur every 30 seconds, but a Poisson process would actually start the operation at intervals of ±5 percent.)
Operations should support excessive short-term operations to support troubleshooting as well as low-impact long-term operations for trending purposes.
There are various ways to generate synthetic traffic, but before we investigate them, a fundamental decision is required: Where should availability be measured?
At the device?
At the network?
At the service?
You can start at the device level to ensure that the device still exists and responds to requests. Monitoring a device in an isolated approach has limited value, so the measurements are extended to include the network level by checking end-to-end availability. Sometimes this might not be sufficient, so the service is monitored in addition to devices and the network. There is no single answer to the question of where to monitor availability, because it depends on the purpose of the measurements. If you want to generate service-level reports, the right level to focus on is the service availability, including mean time to restore (MTTR). For troubleshooting purposes, network and device availability statistics are very beneficial.
The following statements outline the fundamentals of synthetic measurement:
The measurement is performed indirectly by inserting test traffic. This means that records are not collected from the live traffic on the network and there is no direct relationship between the live traffic and the test traffic, except that network elements should treat both traffic types the same and they should traverse the same paths.
Injecting additional test traffic can cause a dilemma. Imagine a situation in which the production traffic is just below the defined threshold and the extra traffic hits the utilization threshold and raises an alarm, even though the network was performing well before. Although this is a valid argument against synthetic measurement traffic, the same could happen with user traffic, when just one additional session creates enough traffic to start an alarm.
There is a performance impact on the measured devices, such as routers, switches, probes, or agents running on the client or server, that should not be neglected. Best practice suggests continuous monitoring of the device CPU utilization.
The simplest approach to test device availability is to send a test packet to the device and watch the result. Even though this sounds almost primitive, it is exactly what the most-used network management tool does: ping (the correct name is ICMP echo) tests a device's availability. Ping provides a set of useful statistics about a device and the network layer that connects the device to the network. Besides the limitations, such as testing only the network interface and related drivers plus parts of the operating system, the outcome can contribute valuable information, especially when these tests are performed continuously, resulting in general device availability reports and statistics. More advanced tests, such as Cisco IP SLA, also measure the device processing time by adding time stamps. Best practice suggests monitoring device availability continuously, preferably not only with a ping test, but with more advanced functions, such as SNMP monitoring of the sysUptime MIB parameter. This reports information on how long the device was operational since the last reboot.
Network availability takes a holistic approach to monitor the network as a system and not just check individual components. Proactive measurement operations include generating synthetic traffic from one end of the network to the other end—but first the term "end" must be defined. For a server operator it would probably mean the client on one side of the network and the server on the other end. A network administrator considers the test between two network edge devices end-to-end.
Ping can help measure network availability. If a ping test reports slow response time, it indicates a general network problem, which usually affects other applications as well. Unfortunately, the reverse assumption is not always correct: Even if ping test results are OK, there might be a severe problem in the network. Consider another case: Just because a server can ping router 1 and router 2, this does not imply that router 1 and router 2 can communicate. This scenario confirms that active probing between the network elements is required, such as by utilizing the PING-MIB or Cisco IP SLA.
Ping is an example of a round-trip measurement. The sender generates a test traffic packet and sends it to the receiver, which marks the packet as read and returns it to the sender. The sender has a timer to measure the total traveling time of the packet, while the receiver only echoes the packets to the sender. Another tool, traceroute, also builds on top of ICMP. It provides more detailed results than ping, such as the round-trip time (RTT) for every hop in the path. This helps identify the slowest link in the path. See the section "Active Monitoring Technologies and Tools: ping, traceroute, and IP SLA" later in this chapter for an in-depth explanation. The general assumption behind round-trip measurement is that the forward and return traffic uses the same paths through the network.
In case of load balancing, where two paths operate in parallel, you need to measure all possible paths. There are multiple types: load balancing per packet, load balancing per destination, load balancing based on the combination of source and destination IP addresses, etc. The load-sharing configuration of all routers along the path is important. In case of per-destination load sharing, all packets for the same destination take the same path. In case of load balancing based on the combination of source and destination IP addresses, all packets from the same flow (defined by IP addresses) take the same path. In case of per-packet load sharing, all possible paths are taken and are measured separately as a consequence. The latter case might cause peaks of RTT: one per path. The described actions apply for ping, IP SLA, and traceroute. Best practice suggests avoiding per-packet load balancing when time measurement is involved.
One-way measurement is an appropriate measuring method for asymmetric routing, load balancing, and increased report granularity. Figure 2-15 shows three different scenarios:
A is the symmetric design. In this case, round-trip measurements are fine, but one-way operation provides the time measurement per direction.
B illustrates the symmetric design with load balancing. A round-trip operation is still OK. However, as explained in the case of load balancing based on a combination of source and destination IP addresses (which is the most deployed load-balancing scheme), only one path is measured by the flow if the IP addresses of the generated packets don't change.
C exemplifies an asymmetric design with a distinct path from source to destination and a different path for the return traffic. Under normal circumstances, an RTT measurement provides sufficient details, because RTT returns the asymmetric path a normal packet would take (R1
R2
R3
R4
R1). If the RTT results exceed a threshold and the operator wants to identify on which path the bottleneck occurs, two one-way measurement operations are required: R1
R2
R3 for the forwarding path and R3
R4
R1 for the return path.
One-way measurement increases the measurement's level of detail, because it provides separate statistics for the forward and return traffic. In the symmetric design (A) in Figure 2-15, assume that you have defined an SLA with an RTT of 20 ms between R1 and R3. suddenly RTT values are above 50 ms, caused by someone who configured one interface of the middle router (R2) with wrong QoS parameters, resulting in delayed forwarding of packets. Round-trip measurement cannot determine where the delay occurs, but a one-way measurement would identify in which direction the delay occurs. For advanced troubleshooting, the operator can configure additional one-way measurements between R1
R2 and R2
R3 to get a detailed picture per hop.
For round-trip measurements, the same device sends and receives the generated traffic; therefore, the absolute time is not relevant for the results. For one-way operations, two different devices need to cooperate, because one device generates the test packets, and a different device receives them and calculates the result. This requires synchronized system clocks on both devices; otherwise, the results are meaningless! Accurate timing is an important requirement in one-way measurement. This can be achieved by connecting Global Positioning System (GPS) receivers to the network elements or by configuring the network time protocol (NTP) in the network.
Note
For more information about NTP, refer to the following white paper: http://www.cisco.com/en/US/tech/tk869/tk769/technologies_white_paper09186a0080117070.shtml.
The previous traffic generation examples addressed the network. Now the focus shifts toward services offered on top of it. Examples of common network services are Dynamic Host Configuration Protocol (DHCP) and DNS. A DHCP server supplies IP addresses to the clients, so you meter the time it takes to fulfill a DHCP request. To monitor the availability of an IP telephony service, you can implement test software that emulates a virtual phone that generates and receives test calls to check the telephony server. An alternative approach is to install a dummy phone in the wiring closet, connect it to the same switch that serves other IP phones, and perform automated test operations at this phone (such as registering the phone at the server and sending calls to the phone). You see immediately that the second approach is much closer to reality, because it tests the server as well as the infrastructure, including the switch that provides inline power to IP phones. The same applies for testing network services. You can perform a ping test to monitor the availability of your central web server. But to be assured that the web server is operational, you need to send an HTTP request and measure how long it takes to succeed. A similar example is the DNS service: a simple ping would prove that the server is operational and connected to the network. It does not tell you anything about the DNS service, so a DNS query and response is necessary to prove the DNS service operation.
Besides DNS and DHCP, examples of synthetic service operations are
HTTP website download
TCP connect (how long it takes to establish a TCP connection)
FTP/TFTP download
Database operations (insert a record, retrieve it, delete it)
IP telephony tests (register and unregister a phone, generate a call)
The demarcation line between the network and server components can be achieved by implementing time stamps. The first time stamp is applied immediately after arrival at the device input interface. Another one can be applied before sending the packet to the device output queue, and the final one immediately before putting the packet on the wire.
Figure 2-16 illustrates the time stamp concept:
TS0 (time stamp 0) is the initial time stamp, created when the packet is sent at the source router.
TS1 is the time stamp when the packet arrives at the destination router's ingress interface.
TS2 is the time stamp when the packet is returned at the destination router's egress interface.
TS3 is the time stamp when the packet is received at the source router's ingress interface.
TS4 is the final time stamp when the packet is received at the source measurement function.
The following results can be analyzed afterwards:
TNetwork (S
D) is the time it took the packet to travel through the network from source (S) to destination .
TNetwork (D
S) is the time it took the packet to travel through the network from destination to source.
TS2 – TS1 is the processing time at the destination device.
TS4 – TS3 is the processing time at the source device.
A service availability report is more complex than a device availability report. A service can be operational and handle requests, but if the response time suddenly increases drastically, users will certainly declare the service to be unavailable. This leaves the network planner in the situation of predefining response time thresholds per service to identify when they are considered unavailable due to performance issues. At this point, the concept of baselining, introduced in Chapter 1, becomes relevant for service management. To estimate the current quality of a specific service, relating it to the overall long-term performance of this service, as well as to other services and the network quality, is more meaningful than considering isolated statements. Service measurement from the user's perspective should be included in the baselining process.
Multiple functions exist to generate synthetic traffic. The best-known and most widely used active measurement tool is certainly the ping test. The correct name is ICMP operation. It consists of a sender and receiver component that interact in the transaction. The sender generates an ICMP echo request packet toward the destination and starts a timer. The receiver reverses the source and destination address in the ICMP header and returns the packet to the sender. As soon as the sender receives the response, the timer is stopped, and the elapsed time is displayed. Options exist to run multiple or continuous operations. At the end some statistics are reported. Ping can be directed to take a specific path through the network, because it supports Loose Source Routing (LSR); however, LSR is disabled most of the time in today's networks. Note that the accuracy of the ping results is limited, because they combine network response time and the processing time at the sender and receiver in one record. Depending on the implementation specifics of the operating system (OS), significant delay can be added if the OS treats ping requests with low priority. Nevertheless, ping is a very useful diagnostic and troubleshooting tool. Results can be displayed at the command-line interface or through MIBs (CISCO-PING-MIB or the IETF pingMIB [RFC 2925]).
A sample ping report is as follows:
C:\WINNT>ping www.cisco.com
Pinging www.cisco.com [198.133.219.25] with 32 bytes of data:
Reply from 198.133.219.25: bytes=32 time=240ms TTL=235
Reply from 198.133.219.25: bytes=32 time=340ms TTL=235
Reply from 198.133.219.25: bytes=32 time=601ms TTL=235
Reply from 198.133.219.25: bytes=32 time=231ms TTL=235
Ping statistics for 198.133.219.25:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 231ms, Maximum = 601ms, Average = 353ms
Traceroute is probably the second-best-known network management tool; it is also based on the ICMP protocol and can be considered an advanced ping. Ping measures only the total round-trip time between source and destination, but traceroute displays the full path and provides statistics such as delay and packet loss on a per-hop basis. This can easily spot a performance bottleneck in the network as well as routing loops or failed devices. Traceroute leverages the time-to-live (TTL) field in the IP header, which is normally used to avoid packets circling forever during a routing loop. When forwarding packets, each Layer 3 device decreases the TTL counter. If it is 0, an ICMP time exceeded message is sent to the originator. Traceroute uses this function, generates a series of ping tests, each with an increased TTL value (starting from 1), and starts a separate timer for each packet. The timer stops when the corresponding ICMP time-exceeded message arrives.
Here's an example of a traceroute report:
C:\>tracert -d www.cisco.com
Tracing route to www.cisco.com [198.133.219.25] over a maximum of 30 hops:
1 <10 ms <10 ms <10 ms 10.61.96.201
2 80 ms 80 ms 81 ms 144.254.221.45
3 80 ms 80 ms 80 ms 144.254.221.35
4 80 ms 80 ms 80 ms 144.254.220.57
5 251 ms 80 ms 80 ms 10.112.2.21
6 411 ms 180 ms 180 ms 10.112.2.25
7 231 ms 170 ms 180 ms 10.112.3.74
8 231 ms 170 ms 180 ms 10.112.3.1
9 240 ms 190 ms 190 ms 10.112.3.109
10 220 ms 190 ms 190 ms 10.112.3.117
11 210 ms 190 ms 190 ms 10.112.3.130
12 230 ms 210 ms 201 ms 10.112.3.114
13 200 ms 210 ms 201 ms 10.112.3.94
14 200 ms 200 ms 211 ms 10.112.3.105
15 240 ms 230 ms 231 ms 10.112.3.82
16 250 ms 241 ms 250 ms 10.112.3.65
17 230 ms 241 ms 230 ms 171.69.7.229
18 230 ms 241 ms 230 ms 171.69.7.174
19 231 ms 240 ms 240 ms 128.107.240.193
20 231 ms 240 ms 240 ms 128.107.239.106
21 230 ms 240 ms 591 ms 198.133.219.25
Trace complete.
A sophisticated tool for generating synthetic traffic is the Cisco IP SLA feature (described in more detail in Chapter 11, "IP SLA"). IP SLA is an active performance-monitoring agent embedded in Cisco IOS software. The agent measures performance by sending synthetic packets to a generic IP device or Cisco device. The packets are echoed to the sender, similar to the functionality of ping. IP SLA uses the time-stamp information to calculate performance metrics (such as jitter, latency, response time, and packet loss).
A target router that is running Cisco IOS software can act as an "IP SLA responder" that processes the IP SLA measurement packets and adds time-stamps. IP SLA can monitor per-class traffic in different traffic classes by setting the Differentiated Service Code Point (DSCP) bits. IP SLA operations can be scheduled to run once or continuously. To support proactive notification, thresholds are defined, and SNMP notifications are generated when these are exceeded. This feature can monitor the actual performance against defined SLAs by notifying the administrator of potential service-level violations. To expedite problem resolution, IP SLA can start an additional operation when a threshold is crossed, which allows for immediate real-time problem analysis. Measurement results can be retrieved with SNMP or from the Cisco IOS command-line interface (CLI).
Table 2-19 summarizes the different characteristics of the three active probing technologies described in this section.
When comparing active and passive measurement concepts, both have benefits and limitations, which leads to the question of how to position both in the best way. Passive measurement offers benefits for network monitoring in general, for application identification, and for troubleshooting, but it assumes that the traffic of interest is already present on the network. To maintain up-to-date statistics and trend reports about network performance, utilization, and the protocol and application mix on the network, you should apply passive measurement concepts. Active measurement extends this by proactively probing if the current performance metrics of the network and the services are within the defined range. As soon as service level agreements are deployed, you should implement proactive collection techniques and link them to a fault management system. As the network administrator, you need tools to identify and solve issues, such as slow service response times. Active monitoring helps identify the problem ideally even before the users call the help desk, but in most cases, it cannot point to the root cause of the issue ("Why is it slow?"). Passive collection helps you identify the root cause, because it meters the live traffic, from which conclusions can be drawn.
Take a situation in which a user calls the network operator and complains about slow network access. Active monitoring could have warned the operator that the RTT between a remote location and a server farm has increased, but it does not explain why it happened. By looking at live network traffic (passive monitoring), the operator found that a user who downloads large video files from the Internet was the cause of the delay. Now the operator can take appropriate action to solve the problem.
In the past, one-way delay (OWD) measurements were implemented as either simple active operations, such as Cisco IP SLA, or complex passive operations, such as the ART MIB. A new approach in the research community considers using packet collection technologies, such as NetFlow, for passive OWD calculation. The basic architecture requires two measurement instances—one on each side of the monitored network. Instead of aggregating packets into flows, raw packets are exported by the meter. The packet selection process is deterministic, and a set of classification rules are required:
Packets must not be modified in any way
Packet recognition must be based on existing packet fields and attributes
Select attributes that do not change across hops (such as IP source/destination address, port number, DSCP/TOS)
Generate an ID for each selected packet by using a hash function
Export the packet ID to the collector and apply a time stamp
Network Time Protocol (NTP) is a prerequisite
By implementing the concept of a unique ID for each selected packet, you also can identify a packet's path through the network and measure OWD on a per-hop basis. Compared to ART, this new approach does not require the network element to identify and measure transaction details. Instead, packets are selected based on different criteria, and the processing is offloaded to a collection station.
Note
For more information, refer to "Passive One-way Delay Measurements and Data Export" at http://www.fokus.gmd.de/research/cc/meteor/employees/carsten.schmoll/powd-netflow9.pdf.