Assignment 3 (10 marks)
Due: March 16, 2010 (3:30pm)
The purpose of this assignment is to gain experience with experimental methods used in computer systems performance evaluation.
Please do any one of the following 3 questions. Note that the marks allocated are the same for each question, but they may not be of the same difficulty.
Q1. Web Response Time Measurements (20 marks)
There are many choices available for Internet access, from Gigabit Ethernet in the workplace to WiFi access at Starbucks to dialup modems at your summer cottage. Your goal in this question is to do an empirical measurement study of the user-perceived Web response time for a small set of (2 or 3) different Internet access technologies of your own choosing. For example, you might choose your desktop environment in the department, wireless access at the U of C, residential Internet access via your ISP, or maybe even your Internet-enabled cell phone. You will be downloading a simple set of Web objects of different sizes, and comparing the user-perceived performance observed.
- (6 marks) Choose your first network environment. Record the Web browser, operating system, TCP implementation, and network access technology being used, as well as the date, time, and duration of your experiments. You might also want to record results from ping and/or traceroute regarding network round-trip time and Internet path being used. Download the simple Web pages of different sizes, recording the user-perceived response time for each. For example, files of sizes 1 KB, 2 KB, 4 KB, 8 KB, 16 KB, 32 KB, 64 KB, 128 KB, 256 KB, 512 KB, and 1 MB are available on the course Web site. Repeat your experiments if you wish to increase the statistical confidence in your results. Show your results in a graph or table (or both). Analyze and discuss your results, focusing particularly on the relationship (if any) between file size and user-perceived response time.
- (6 marks) Choose a second (and different) network environment. Again, record all the appropriate meta-data for your experiment. Download the simple Web pages of different sizes, recording the user-perceived response time for each, and repeating your experiments if desired. Show your results in a graph or table (or both). Analyze and discuss your results, as done in the previous experiment.
- (8 marks) Compare and contrast your results from the two environments. Highlight your observations. Comment on differences observed, if any. Explain your results as best you can.
Bonus (up to 4 marks)
Collect similar measurements from a third different network environment. Compare and contrast your results with the previous ones.
Q2. File System Workload Characterization (20 marks)
The data file sample.txt contains a small sample of a much larger data file files.txt (6 MB uncompressed text file, available upon request) that contains the output of the Unix command "ls -lR" in my home directory on the CPSC file servers in January 2010. The output shows information such as the name of each file and directory, the file permissions, the file size, the file modification date, and so on.
You will use this empirical data file in a workload characterization study of the Unix file system (albeit only for 1 user). Using data analysis tools of your own choosing (e.g., grep, awk, perl, gnuplot, Excel, MatLab, C, C++, Java, Python), process this empirical data set to answer as many of the following questions as you can.
- (2 marks) How many different files and directories are there? What is the aggregate size of these files (in bytes)?
- (2 marks) What is the largest file? How big is it?
- (4 marks) What is the mean file size? What is the standard deviation of file size? What is the median file size (50-th percentile value)? What is the mode (most frequently occurring value) of the file size distribution?
- (4 marks) Plot a graph showing the file size distribution, using a Cumulative Distribution Function (CDF). Use a graph style and axis scaling (e.g., linear, logarithmic, log-linear, log-log) of your own choosing to convey the distribution effectively. Comment on your observations.
- (4 marks) With some clever programming effort, you should be able to calculate (or estimate) the age of each file (i.e., the number of days since it was last modified). What is the oldest file? How old is it? What is the newest file? How old is it? What are the mean, median, and mode for the file age distribution?
- (4 marks) Plot a CDF graph showing the file age distribution. Use a graph style and axis scaling of your own choosing to convey the distribution effectively. Comment on your observations.
Bonus (up to 4 marks)
With a bit of effort, you should be able to analyze the file type distribution. On a Unix system, file types can be determined heuristically based on the (optional) suffix in the file name (e.g., foo.html, paper127.pdf, painful.doc). Produce a table showing the top 10 identifiable file types in the data, in sorted order from most prevalent to least prevalent. Within this table, show the number of files of each type, the percentage of files of each type, the number of bytes for each file type, and the percentage of bytes for each file type. If necessary, use a catch-all category "Unknown" for any file types that are not easily discernible from the file name suffix. In the table, add a category "Other" for those files not accounted for among the top 10 file types, so that the percentages in the table sum properly to 100%. Comment on your observations.
Q3. Wireless Network Data Analysis (20 marks)
Choose any interesting data set from the CRAWDAD (Community Resource for Archiving Wireless Data at Dartmouth) Web site and analyze it.
Bonus (up to 4 marks)
Compare and contrast your wireless data analysis results with those from another environment, such as the U of C network.
Submitting Your Assignment
When you are finished, hand in a hardcopy version of your solution to your instructor, either in person, or under his office door. Provide proper citation for any literature or Internet sources used. Submissions must be received on or before the stated submission deadline, otherwise a late penalty of 10% (2 marks) per day will apply.