CPSC 641: Performance Issues in High Speed Networks

Professor Carey Williamson

Winter 2019

Assignment 1: Empirical Data Analysis (30 marks)

Due Date: Thursday, January 31, 2019 (4:00pm)

The purpose of this assignment is to gain experience with data analysis, statistical methods, graph plotting, and interpretation of results. You will analyze several empirical datasets, applying your data analysis skills to explore and understand some of the structural properties of the data.

Background

One assumption that is often made in analytical modeling work is that events in a computer system (e.g., user requests, CPU jobs, file I/O, network packets, disk failures, Skype calls) occur according to a "Poisson arrival process". Informally, this means that events occur "randomly" (i.e., at random time instants that are impossible to predict, even if the average arrival rate is known). More formally, it means that the inter-arrival times between events are exponentially distributed and independent. If this is the case, then the counts of the number of events that occur within any chosen fixed-size time interval (e.g., 1 minute, 1 hour, 1 day) should follow the Poisson distribution (i.e., a discrete distribution for which the mean and variance are equal; the histogram for such a distribution usually has a nice humpy shape with a pronounced tail on the right).

In a Poisson arrival process, there is a well-defined average rate for the arrivals, such as K events per time unit. As a result, the mean inter-arrival time is 1/K time units between events. Furthermore, the distribution of the inter-arrival times is exponential. The exponential distribution is the only continuous distribution with the "memoryless" property, which makes it easier to analyze mathematically. Recall that an exponential distribution has only positive values (i.e., strictly greater than zero), with no upper limit (i.e., potentially infinite). Nonetheless, the histogram of such a distribution typically shows a lot of small values at or below the mean, and a gracefully declining probability (i.e., exponential decay) of observing values much larger than the mean. In particular, the Coefficient of Variation (CoV, which is the ratio of the standard deviation to the mean) for the exponential distribution is exactly 1.

In experimental performance evaluation work, people often use empirical datasets that record the actual event arrival times in real systems. One important skill is knowing whether empirical data is consistent with a Poisson arrival process or not. If the arrival process is not Poisson, then the performance of the system (e.g., queueing, loss, response time, throughput) could be quite different than that predicted by an analytical model (i.e., better or worse, depending on the CoV and the correlation structure, if any).

Your Task

Your task in this assignment is to analyze some empirical datasets and determine which (if any) are consistent with a Poisson arrival process. Note that checking for this property involves two separate tasks: (1) checking if the inter-arrival times are exponential; and (2) checking if the inter-arrival times are independent. The first task (exponentiality) can be done in a variety of ways (e.g., statistical methods, graphical methods, goodness-of-fit tests, QQ plots, Anderson-Darling, KS test, Chi-Square test, etc.), as you deem appropriate. The second task (independence) can also be done in several ways (e.g., statistical, graphical, autocorrelation, etc.), but technically isn't required if the exponentiality test has already failed. (See Appendix A of the 1994 ACM SIGCOMM paper by Paxson and Floyd on "The Failure of Poisson Modeling" for a detailed discussion of these tests)

Please do any six (but not all!) of the following ten empirical datasets:

  1. nest: days on which my NEST thermostat reset itself (N=35)
  2. haircuts: days on which I got my hair cut (N=54)
  3. whistles: the times at which a referee blew his whistle during a hockey game (N=68)
  4. paydays: days on which the U of C issued paychecks into my account (N = 96)
  5. goals: times at which the Calgary Flames scored goals (so far!) this season (N=190)
  6. papers: the times when an ACM journal received paper submissions (N=393)
  7. emails: the times at which I received emails about the IWQoS 2018 conference (N=924)
  8. logouts: the times when students logged out of D2L on the evening of March 1, 2017 (N=1,192)
  9. cars: the times at which (simulated) cars arrive to the Banff park entrance (N=10,000)
  10. packets: the timestamps of packets (frames) seen on an Ethernet LAN (N=1,000,000)

Note that these datasets are all different sizes, and in different formats, just to give you some extra practice in your data analysis skills. Also note that some datasets (intentionally) have some imperfections remaining, so please watch out for these, and find a reasonable way to handle them. If you have any questions about the data formats, let me know. Enjoy!

Data Analysis Tasks

For each of your chosen datasets, do the following analysis steps to help answer the questions indicated:

Produce a table to summarize your results. Use one row for each dataset, and use the columns to summarize the main features of each dataset (e.g., number of data points, duration, min/median/max iat, mean and standard deviation of iat, CoV, exponentiality, independence, and whether it is a Poisson arrival process or not). See below for a crude example of a suggested table format.

File NumObs Duration Min Median Max Mean StdDev CoV Exponential? Independent? Poisson?
foo1 120 3.2 hrs 2.6 8.0 75.4 20.4 12.5 0.6 No N/A No
foo2 500 7.1 yrs 106 819 6475 436 450 1.03 Yes Yes Yes
foo3 1,200 60 min 0.002 0.032 0.124 0.05 0.05 1.0 Yes No No

Optional Bonus (2 marks)

Augment your results table with one additional empirical dataset of your own personal choice (i.e., not from my list of datasets above). Make sure that it has at least 100 data points, but not more than 10,000. Say what the dataset is, and how it was collected. Then complete your results table with your observations about this empirical dataset. State whether it follows a Poisson arrival process or not, and give some (brief) logical explanation as to why or why not.

Assignment Submission

When you are finished, please submit your assignment solution in hardcopy form to your instructor, on or before the stated deadline. Please include your summary table showing results for all six datasets that you chose, and any relevant parts of your writeup. However, you only need to include the pdf/CDF graphs for two of your six datasets, with one of them being a good example of Poisson arrivals, and one not. Thus you should make sure that there is at least one example of each type among the six datasets that you choose for analysis. If you do the bonus, please include those pdf/CDF graphs as well. Thanks!