CSCI 262 Data Structures

Fall 2017

Lab 4: Sets and Maps

(Back to Labs)

Goals


Overview


This lab will walk you through some exercises using Sets and Maps. As usual, we ask that you work with a partner or a team. If you end up solo, please raise your hand and we'll pair you with someone else (or make a three-person team if necessary). You are welcome to change partners throughout the semester!


Part 1 Instructions


Task 1.0: preliminaries

Download dictionary.txt, which contains a large body of strings we'll use for testing.

Next, download lab04.cpp and lab04b.cpp, the starter code files for this project. Go ahead and create a project in Cloud9 as usual, and upload the starter code and dictionary.txt to it.


Task 1.1: figure out what is going on

Read through the lab04.cpp code provided. Much of the code is stuff you have seen before or practiced in a previous lab - opening a file, reading in its contents, or adding items to a vector. Some of it may be less familiar, but the comments should help.

In brief, the code is doing the following:

  1. Reading the words from dictionary.txt into a vector
  2. Creating a subset of the vector (of size TEST_SIZE) to use as a test collection
  3. Timing how long it takes to search the vector for all the words in the test collection

Pay close attention to the timing code, as you will need this again, and it is generally useful to have for your own performance optimization of your code in the future. If you want to see how long an operation will take coded one way versus another, wrap some timing code around it and find out. The only catch is, computers are so fast that you won't get very helpful numbers unless you perform the operation many times - which is what the TEST_SIZE parameter in our program is for.

The clock() function is the critical piece of the timing code. It returns a value of type clock_t, which is a number counting the "ticks" of some architecture/operating system-dependent clock since the start of the program. You get the clock value at the start of what you want to time, then again at the end, and subtract to get the number of clock ticks. For human eyes, it is generally helpful to convert to standard seconds - you do this by dividing by CLOCKS_PER_SEC. CLOCKS_PER_SEC and the clock() function are both provided in the <ctime> library.

Compile and run the starter code:

    g++ -o lab04 lab04.cpp
    ./lab04

Calibrate the TEST_SIZE constant at the top of lab04.cpp. You want the code to take long enough to see that something is actually happening, but you don't want to spend all your time waiting. If the timed part of the program is in the 5 - 15 seconds range, that's about what you want. Adjust TEST_SIZE downward to reduce the time, adjust it upwards to increase the time (recompile each time).


Task 1.2: time using a vector as if it were a set

In the lecture on Sets and Maps, we described the Set as an ADT having three principal operations: adding, removing, and testing for containment. For the first part of our experiment today, we are going to see how well a vector does at one critical operation: testing for containment. The provided code does just that, by filling a vector with (unique) words and repeatedly looking for the word in the vector.

The way you look for a string in an arbitrary vector of strings is: start at the beginning, comparing words as you go - as soon as you find the word, stop and return true. If you get to the end without finding the word, return false. There are 127,141 words in dictionary.txt, but looking through them one at a time is still very fast - which is why we have to test using so many words! The provided code uses the <algorithm> library function find() to search the vector in the fashion described, but you could just as well loop on the size of the vector with an index, or even do a range-based for loop (although find() is generally faster).

Run the provided code three times to collect three different timings; they shouldn't vary by much. Record the timings for entry into Canvas!


Task 1.3: time using a set as a set

Rewrite the provided code to now perform the same tests using a set instead of a vector (be sure to use the count() or find() methods of set to test for the existence of a value in the set). Make sure you leave TEST_SIZE unchanged, so you can compare fairly against your vector tests. Record timings for three trials using the set.

Note that the code to set up the set, including declaring the set and adding all of the words in dictionary.txt to it, has already been written. All you need to do is change the code that is being timed. Be sure to recompile, and you might be surprised by the timing difference this change makes!


Task 1.4: apply compiler optimization

Modern compilers are wonderfully complex, and include various optimization routines to speed code up. These can have significant impact on code performance, with only a slight increase in compile time. To see the effect of optimization on code performance, try compiling with optimizations turned on using the -O3 flag for g++, and rerun your timings (you don't need to record these):

    g++ -O3 -o lab04 lab04.cpp

For further exploration and extra credit, choose one or more of the questions below, and provide a paragraph or so answering the question in Canvas.

  1. What difference does it make in the timings if you use a test set made up of randomly generated strings (say, of length 4), rather than words you know are in the dictionary? Why might this make a difference?
  2. How long does it take to fill the set with words from dictionary.txt versus filling the vector with the same words?
  3. How long does it takes to remove all of the test words from the vector versus removing them from the set?

Part 2 Instructions


For part 2, you are going to explore using a map to return (randomly selected) words of a length chosen by the user. Some starter code (lab04b.cpp) has been written for you, with a small number of words simply hard-coded into the program. You will be improving this program to be more robust, and to include all of the words from dictionary.txt.

Task 2.1: compile and run

Compile and run the program:

    g++ -o lab04b lab04b.cpp
    ./lab04b

Now try out a few word lengths between 1 and 4. The program should offer you a randomly selected word from its lists of words of that length. Now try a length greater than 4 and record what happens.


Task 2.2: improve the program

There are two major issues with the above program, and your job is to fix them. Your final solution will be turned in as part of your lab 4 submission. Here are the two problems:

When fixing the second problem, there are a couple of approaches you can take; one is significantly faster than the other. Remember in class we talked about how the [] operator returns a reference to the value stored in the map, rather than a copy? Try expanding the vectors in-place using [] rather than getting and putting the vector values.


Submission Instructions:


Answer the questions for the lab 4 quiz on Canvas. You will need to paste in your improved Part 2 code for one of the answers.