CSCI 261 - Programming Concepts - Fall 2021

A6 - Green Eggs and Ham

→This assignment is due by Tuesday, November 09, 2021, 11:59 PM.←
→ As with all assignments, this must be an individual effort and cannot be pair programmed. Any debugging assistance must follow the course collaboration policy and be cited in the comment header block for the assignment.←
→ Do not forget to complete the following labs with this set: L6A, L6B, L6C

· Instructions · Rubric · Submission ·

In this homework, we will focus on arrays, vectors, strings, File I/O, and Functions!


Overview


Have you ever finished a book and wondered, "Geez, I wonder how many times each word occurs in this text?" No? This assignment illustrates a fundamental use of the array & vector: storing related values in a single data structure, and then using that data structure to reveal interesting facts about the data.

For this assignment, you will read in a text file containing the story Green Eggs and Ham. You will then need to count the number of occurrences of each word & letter and display the frequencies. You'll be amazed at the results!


The Specifics


For this assignment, download the A6 code pack. This zip file contains several files:

The contents of main.cpp are shown below:

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
using namespace std;

#include "functions.h"

int main() {
    // get filename to open
    string filename = promptUserForFilename();

    // open file for parsing
    ifstream fileIn;
    if( !openFile(fileIn, filename) ) {
        cerr << "Could not open file \"" << filename << "\"" << endl;
        cerr << "Shutting down" << endl;
        return -1;
    }

    // read all the words in the file
    vector<string> allWords = readWordsFromFile(fileIn);
    fileIn.close();
    cout << "Read in " << allWords.size() << " words" << endl;

    //*************************************************************************************
    // word processing

    // clean the words to remove punctuation and convert to uppercase
    removePunctuation(allWords, "?!.,;:\"()_");
    capitalizeWords(allWords);

    // find only the unique words in the file
    vector<string> uniqueWords = filterUniqueWords(allWords);
    cout << "Encountered " << uniqueWords.size() << " unique words" << endl;

    // count the number of occurrences of each word
    vector<unsigned int> uniqueWordCounts = countUniqueWords(allWords, uniqueWords);

    // sort the words by count
    sortWordsByCounts(uniqueWords, uniqueWordCounts);

    // pretty print the unique words and their corresponding counts
    printWordsAndCounts(uniqueWords, uniqueWordCounts);

    //*************************************************************************************
    // letter processing

    // count the occurrences of every letter in the entire text
    unsigned int letterCounts[26] = {0};
    countLetters(allWords, letterCounts);
    printLetterCounts(letterCounts);

    // print statistics on letter frequencies
    printMaxMinLetter(letterCounts);

    return 0;
}

Take note how the program now reads as a series of subtasks and the provided comments are redundant. The code is "self documenting" with the function names providing the steps that are occurring. Your task is to provide the implementations for all of the referenced functions. You will need to create two files: functions.h and functions.cpp to make the program work as intended.

You will want to make your program as general as possible by not having any assumptions about the data hardcoded in. Two public input files have been supplied with the starter pack. We will run your program against a third private input file.


Function Requirements


The requirements of each function are given below. The input, output, and task of each function is described. The functions are:

  1. promptUserForFilename()
  2. openFile()
  3. readWordsFromFile()
  4. removePunctuation()
  5. capitalizeWords()
  6. filterUniqueWords()
  7. countUniqueWords()
  8. sortWordsByCounts()
  9. printWordsAndCounts()
  10. countLetters()
  11. printLetterCounts()
  12. printMaxMinLetter()

promptUserForFilename()

Input: None

Output: A string

Task: Prompt the user to enter a filename.

openFile()

Input:

  1. The input file stream
  2. The string filename to open

Output: True if the file successfully opened, False if the file could not be opened

Task: Open the input file stream for the corresponding filename. Check that the file opened correctly. The string filename will remain unchanged.

readWordsFromFile()

Input: The input file stream

Output: A vector of strings

Task: Read all of the words that are in the filestream and return a list of all the words in the order present in the file.

removePunctuation()

Input:

  1. A vector of strings
  2. A string of all the punctuation characters to remove

Output: None

Task: For each word in the vector, remove all occurrences of all the punctuation characters denoted by the punctuation string. When complete, the input vector will now hold all the words with punctuation removed. The punctuation string will remain unchanged.

capitalizeWords()

Input: A vector of strings

Output: None

Task: For each word in the vector, convert each character to its upper case equivalent. When complete, the input vector will now hold all the words capitalized.

filterUniqueWords()

Input: A vector of strings

Output: A vector of strings

Task: The function will return only the unique words present in the input vector. The output vector will not contain any duplicate words.

countUniqueWords()

Input:

  1. A vector of strings representing all of the words in the file
  2. A vector of strings representing only the unique words in the file

Output: A vector of unsigned integers

Task: For every unique word in the list, count the number of occurrences the unique word is present in the full text. Return a vector of all the counts. Each position in the vector of counts corresponds to the same position in the unique word list. The vector of counts will have the same size as the vector of unique words. Upon completion, neither input vector will be modified.

sortWordsByCounts()

Input:

  1. A vector of strings representing only the unique words in the file
  2. A vector of unsigned integers representing the counts for each unique word

Output: None

Task: Sort the strings and counts in the input vectors by counts from greatest to smallest. If two strings are present the same number of times, then sort the strings alphabetically. When complete, the words will be sorted from most frequent to least frequent with equal occurrences sorted alphabetically. When complete, the counts will be sorted from greatest to smallest.

Refer to the expected output files for examples on the expected ordering.

printWordsAndCounts()

Input:

  1. A vector of strings
  2. A vector of unsigned integers

Output: None

Task: For each word and count in the vectors, print out the word and its corresponding count. Upon completion, the two vectors will remain unchanged. Format the output as follows:

#P : ABCDEF : #C

Notice how there are three columns separated by :. We want the : aligned in every row and the values aligned in each column. The columns correspond to the following values:

  1. #P - The position of the word in the list. Begin at 1. Right align all values. Allocate enough space for the length of the last position. (If there are less than 10 numbers, then we need only 1 space. If there are less than 100 numbers, then we need only 2 spaces. And so on. Assume there will be at most 109 unique words.)
  2. ABCEDF - The unique word. Left align all values. Allocate enough space for the longest word present in the list.
  3. #C - The corresponding count of the unique word. Right align all values. Allocate enough space for the length of the largest number. (Assume there will be at most 109 unique words.)

An example with actual values is shown below:

1 : BIRTHDAY : 4
2 : HAPPY    : 4
3 : TO       : 4
4 : YOU      : 3
5 : BJORNE   : 1

Refer to the expected output files for longer examples on the expected formatting.

countLetters()

Input:

  1. A vector of strings
  2. An array of 26 unsigned integers

Output: None

Task: Count the number of occurrences of each letter present in all words. Each position of the array corresponds to each letter as ordered by the English alphabet. Upon completion, the array will hold the counts of each letter and the vector of strings will remain unchanged.

printLetterCounts()

Input: An array of 26 unsigned integers

Output: None

Task: For each letter, print out the letter and its corresponding count. Format the output as follows:

A : #C
B : #C
...
Y : #C
Z : #C

Notice how there are two columns separated by :. We want the : aligned in every row and the values aligned in each column. The columns correspond to the following values:

  1. A - The letter
  2. #C - The corresponding count of the letter. Right align all values. Allocate enough space for the length of the largest number. (Assume there will be at most 109 unique words.)

An example with actual values is shown below:

A :  8
B :  5
C :  0
D :  4
E :  1
F :  0
G :  0
H :  8
I :  4
J :  1
K :  0
L :  0
M :  0
N :  1
O :  8
P :  8
Q :  0
R :  5
S :  0
T :  8
U :  3
V :  0
W :  0
X :  0
Y : 11
Z :  0

Refer to the expected output files for longer examples on the expected formatting.

printMaxMinLetter()

Input: An array of 26 unsigned integers

Output: None

Task: Print out the two letters that occur least often and most often. If there is more than one letter that occurs the same number of times, print the one that comes first alphabetically. Upon completion, the input array will remain unchanged. Print out the following pieces of information:

  1. The letter
  2. The frequency of appearance as a percentage to 3 decimal places

Format the output as follows:

Least Frequent Letter: A (#P%)
Most Frequent Letter:  Z (#P%)

Notice how there are two columns of values. The columns correspond to the following values:

  1. A - The letter.
  2. #P - The frequency of the letter. Right align all values. Print to three decimal places.

An example with actual values is shown below:

Least Frequent Letter: C  (  0.000%)
Most Frequent Letter:  Y ( 14.667%)

Refer to the expected output files for longer examples on the expected formatting.


Functional Requirements



Hints




Grading Rubric


Your submission will be graded according to the following rubric.

PointsRequirement Description
6 All labs completed and submitted
L6A, L6B, L6C
22 Each function input/output correct as specified and performs correct task meeting the functional requirements.
2 (1) Comments used
(2) Coding style followed
(3) Appropriate variable names, constants, and data types used
(4) Instructions followed
30 Total Points

→This assignment is due by Tuesday, November 09, 2021, 11:59 PM.←
→ As with all assignments, this must be an individual effort and cannot be pair programmed. Any debugging assistance must follow the course collaboration policy and be cited in the comment header block for the assignment.←
→ Do not forget to complete the following labs with this set: L6A, L6B, L6C


Submission


Always, always, ALWAYS update the header comments at the top of your main.cpp file. And if you ever get stuck, remember that there is LOTS of help available.

It is critical that you follow these steps when submitting homework. You can view these steps by watching the Windows / Mac video.

If you do not follow these instructions, your assignment will receive a major deduction. Why all the fuss? Because we have several hundred of these assignments to grade, and we use computer tools to automate as much of the process as possible. If you deviate from these instructions, our grading tools will not work.


Submission Instructions



Here are step-by-step instructions for submitting your homework properly:

  1. Make sure you have the appropriate comment header block at the top of every source code file for this set. The header block should include the following information at a minimum.
    /* CSCI 261: Assignment 6: A6 - Green Eggs and Ham
     *  * Author: XXXX (INSERT_NAME) * Skip Days Used: #
    * Skip Days Remaining: #
    * Resources used (Office Hours, Tutoring, Other Students, etc & in what capacity):  * // list here any outside assistance you used/received while following the * // CS@Mines Collaboration Policy and the Mines Academic Code of Honor *  * XXXXXXXX (MORE_COMPLETE_DESCRIPTION_HERE)  */
    Be sure to fill in the appropriate information, including:
    • Assignment number
    • Assignment title
    • Your name
    • How many skip days you are applying to this assignment (if you are applying none, still enter zero)
    • The number of skip days you have left for the remainder of the semester (keep track of how many you have used across all assignments)
    • If you received any type of assistance (office hours - whose, tutoring - when), then list where/what/who gave you the assistance and describe the assistance received
    • A description of the assignment task and what the code in this file accomplishes.
  2. File and folder names are extremely important in this process. Please double-check carefully, to ensure things are named correctly.
    1. The top-level folder of your project must be named Set6
    2. Inside Set6, create 4 sub-folders that are required for this Set. The name of each sub-folder is defined in that Set (e.g. L6A, L6B, L6C, and A6).
    3. Copy your files into the subdirectories ofSet6 (steps 2-3), zip this Set6 folder (steps 4-5), and then submit the zipped file (steps 6-11) to Canvas.
    4. For example, when you zip/submit Set6, there will be 4 sub-folders called L6A, L6B, L6C, and A6 inside the Set6 folder, and each of these sub-folders will have the associated files.

  3. Using Windows Explorer (not to be confused with Internet Explorer), find the files named functions.h, functions.cpp.

    STOP: Are you really sure you are viewing the correct assignment's folder?

  4. Now, for A6, right click on the functions.h, functions.cpp to copy the files. Then, return to the Set6/A6 folder and right click to paste the files. In other words, put a copy of your homework's functions.h, functions.cpp source code into the Set6/A6 folder.

    Follow the same steps for L6A, to put a copy of your lab's main.cpp into the Set6/L6A folder. Repeat this process for Set6/L6B, Set6/L6C.

    STOP: Are you sure your Set6 folder now has all your code to submit?

  5. Now, right-click on the "Set6" folder.
    1. In the pop-up menu that opens, move the mouse "Send to..." and expand the sub-menu.
    2. In the sub-menu that opens, select "Compressed (zipped) folder".

    STOP: Are you really sure you are zipping a Set6 folder with sub-folders that each contain a main.cpp file in it?

  6. After the previous step, you should now see a "Set6.zip" file.

  7. Now visit the Canvas page for this course and click the "Assignments" button in the sidebar.

  8. Find Set6, click on it, find the "Submit Assignment" area, and then click the "Choose File" button.

  9. Find the "Set6.zip" file created earlier and click the "Open" button.

    STOP: Are you really sure you are selecting the right homework assignment? Are you double-sure?

  10. WAIT! There's one more super-important step. Click on the blue "Submit Assignment" button to submit your homework.

  11. No, really, make sure you click the "Submit Assignment" button to actually submit your homework. Clicking the "Choose File" button in the previous step kind of makes it feel like you're done, but you must click the Submit button as well! And you must allow the file time to upload before you turn off your computer!

  12. Canvas should say "Submitted!". Click "Submission Details" and you can download the zip file you just submitted. In other words, verify you submitted what you think you submitted!

In summary, you must zip the "Set6" folder and only the "Set6" folder, this zip folder must have several sub-folders, you must name all these folders correctly, you must submit the correct zip file for this homework, and you must click the "Submit Assignment" button. Not doing these steps is like bringing your homework to class but forgetting to hand it in. No concessions will be made for incorrectly submitted work. If you incorrectly submit your homework, we will not be able to give you full credit. And that makes us unhappy.


→This assignment is due by Tuesday, November 09, 2021, 11:59 PM.←
→ As with all assignments, this must be an individual effort and cannot be pair programmed. Any debugging assistance must follow the course collaboration policy and be cited in the comment header block for the assignment.←
→ Do not forget to complete the following labs with this set: L6A, L6B, L6C