CSCI 262 Data Structures

Fall 2018

Lab 4: Sorting

(Back to Labs)

Goals


Overview


This lab will walk you through some exercises using the standard library sort function. As usual, we ask that you work with a partner or a team. If you end up solo, please raise your hand and we'll pair you with someone else (or make a three-person team if necessary). You are welcome to change partners throughout the semester!


Background Info


As we discussed in class, sorting is a first step in many algorithms. Some of the problems we'll explore below, such as finding the mode of a collection, are quite expensive to solve without sorting, but become linear or cheaper once a collection is sorted. (Another example is binary search, which we know becomes an O(log n) problem after sorting. While sorting has a cost of O(n log n), if we search the collection many (>n) times, sorting pays off.)

The standard library contains multiple sort functions for different applications. We'll just focus on the basic sort function, which has O(n log n) worst-case performance and uses no additional memory (in a big-O sense) over the collection being sorted.

The sort function is actually a template function, which is something we'll study later in the semester. For now what is important is that sort knows how to sort collections of different types without any special work. For this assignment, we'll be sorting vectors of integers, vectors of strings, and strings.

As we saw in class, to sort a vector named vec, we only need to #include <algorithm>, and then simply:

    sort(vec.begin(), vec.end());
    

The begin and end methods return special objects called iterators, which we'll talk about more later in the semester.


Problem 1: Selection & Median


The selection problem asks, given a collection of n values, what is the kth smallest value in the collection? The most important selection problem is finding the median, or the (n/2)th element. You can probably see how sorting makes this problem very easy: simply sort the values in the collection, then return the kth index! (It turns out that there is a very nifty algorithm for selection which is O(n), but it is very complicated to explain; hopefully you will study it in a future class. You can read a somewhat dense explanation of the algorithm on Wikipedia.)

To explore this particular problem, you will work with the file words.txt, which contains 50,000 unique words chosen randomly from the Scrabble dictionary we used in a previous lab. For this task, you need to read in all the words from words.txt and then tell us the 8489th lexicographically smallest, 38719th lexicographically smallest, and 2796th lexicographically smallest words in the file. (A word is lexicographically smaller than another if it sorts less than the other word - lexicographical sorting is like alphabetical, except it uses ASCII codes rather than the alphabet. In our setting, we only have lower-case letters, so it is the same as alphabetical.) Enter your answer in the Canvas quiz for this lab. Don't forget about zero-based indexing!

You can write all the code yourself if you want the practice, but since we are focusing on sorting and not file I/O for this lab, here's some starter code you are welcome to paste into main.cpp in a new project:

/*
 * solution1.cpp
 *
 * Starter code for solving Problem 1: Selection & Median
 * Lab 4 - Sorting
 * CSCI 262 Data Structures
 *
 */

#include <string>
#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

int main() {
        vector<string> vec;

        ifstream fin("words.txt");
        if (!fin) {
                cerr << "File 'words.txt' not found. Exiting." << endl;
                return -1;
        }

        for (int i = 0; i < 50000; i++) {
                string word;
                fin >> word;

                if (!fin) {
                        cerr << "Error reading words from 'words.txt'.  Exiting." << endl;
                        return -1;
                }

                vec.push_back(word);
        }
        fin.close();

        // print 8489th, 38719th, and 2796th words from vec

        return 0;
}

Note that the words.txt file must be in the directory where your program is running from. In CLion by default, that is the "cmake-build-debug" directory.


Problem 2: Mode


The mode of a collection is the value that appears the most times in the collection. For example, given the collection of words {"apple", "orange", "orange", "pear", "apple", "orange"}, "orange" is the mode since it appears three times in the collection, more than any other word.

Sorting makes this problem easier by putting equivalent values adjacent to each other. You can then find the mode in linear time by looping over the values and counting adjacent "runs" of values (sequences of the same value). Find the mode of the values in the file numbers.txt, which contains 50,000 integers. (Note that multiple modes are possible, but we've fixed it so there is only one mode in this instance.) Report both the mode and how many times it appears.

As for the previous problem, here's some starter code.

/*
 * solution2.cpp
 *
 * Starter code for solving Problem 2: Mode
 * Lab 4 - Sorting
 * CSCI 262 Data Structures
 *
 */

#include <iostream>
#include <fstream>
#include <vector>

using namespace std;

int main() {
	vector<int> vec;

	ifstream fin("numbers.txt");
	if (!fin) {
		cerr << "File 'numbers.txt' not found. Exiting." << endl;
		return -1;
	}

	for (int i = 0; i < 50000; i++) {
		int number;
		fin >> number;

		if (!fin) {
			cerr << "Error reading number from 'numbers.txt'.  Exiting." << endl;
			return -1;
		}

		vec.push_back(number);
	}
	fin.close();

	// print the mode of the values in numbers.txt

	return 0;
}

Problem 3: Detecting anagrams


How can you tell if two strings are anagrams of each other? One way is to simply count the number of times each letter appears in each string, then compare counts. A less tedious way, though, is to simply sort the values in each string and then compare. If the original strings were anagrams of each other, the sorted strings will be the same!

Sorting a string is the same as sorting a vector. Given string s, just do:

    sort(s.begin(), s.end());

For this problem, we're giving you a vector of strings, all of the same length. Find the string that is an anagram of "computer science rocks", and report it on the Canvas quiz! Here's your starter code with the problem embedded:

/*
 * solution3.cpp
 *
 * Starter code for solving Problem 3: Detecting anagrams
 * Lab 4 - Sorting
 * CSCI 262 Data Structures
 *
 */

#include <string>
#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

int main() {
        vector<string> possibles = {
                "poor musket conscience",
                "cricketer unstop mucus",
                "concupiscent elks roam",
                "eccentric coke possums",
                "crescent pocket curium",
                "mockup curt reticences",
                "poke concentric mouses",
                "skip comets occurrence",
                "concrete cupric musket",
                "circumspect coke noses",
                "sum concoction keepers",
                "succinct composer ekes",
                "mercuric petcock tunes",
                "muskie concocts crepes",
                "smoker conscience pout",
                "cement truck occupiers",
                "eccentrics truck poems"
        };

        string match = "computer science rocks";

        // print the anagram of "computer science rocks"

        return 0;
}

Submission Instructions:


Answer the questions for the lab 4 quiz on Canvas.