CSCI 432
Operating Systems
Home | Calendar | Assignments | CS@Williams
Project 0 - Inverted Index
This project will give you experience writing a simple C++ program
using the STL. It also gives you some practice with the autograder, and allows
me to make sure everything is configured correctly.
For this assignment, you will write a program in C++ that generates an
"inverted index" of all the words in a list of text files. (See
Wikipedia for more
details regarding inverted indexes.) The
goal of this assignment is to ensure that you are sufficiently up to
speed in C++ to handle the rest of the course.
Future assignments require you to use C++11. You should compile with
g++ -std=c++11. I also strongly encourage you to use a simple Makefile for compiling your code.
Inverter Input
Your inverter will take exactly one argument: a file that contains a
list of filenames. Each filename will appear on a separate line.
Each of the files described in the first file will contain text from
which you will build your index. For convenience, you can download a
sample Makefile and the files described below here.
For example:
inputs.txt
-----
foo1.txt
foo2.txt
foo1.txt
-----
this is a test. cool.
foo2.txt
-----
this is also a test.
boring.
Inverter Output
Your inverter should print all of the words from all of the inputs, in
"alphabetical" order, followed by the document numbers in which they
appear, in order. For example (note: your program must produce exactly
this output):
a: 0 1
also: 1
boring: 1
cool: 0
is: 0 1
test: 0 1
this: 0 1
Alphabetical is defined as the order according to ascii. So "The" and
"the" are seperate words, and "The" comes first. Only certain words
should be indexed. Words are anything that is made up of only alpha
characters, and not numbers, spaces, etc. "Th3e" is two words, "Th" and "e".
Files are incrementally numbered, starting with 0. Only valid,
openable files should be included in the count. (is_open comes in handy here.)
You may assume that you will not be given any duplicate files in the input file (i.e.,
foo1.txt will only appear once).
Your program should absolutely not produce any other output.
Extraneous output, or output formatted incorrectly (extra spaces etc.)
will make the autograder mark your solution as incorrect. Please leave
yourself extra time to work these problems out.
Implementation Hints
Implement the data structure using the C++ Standard Template Library (STL)
as a map of sets, as in:
map<string, set<int> > invertedIndex;
Use C++ strings and file streams. Sample Code:
#include <string>
#include <fstream>
Make sure that your project uses an ifstream, not an fstream. Both ifstreams and
fstreams are found in the fstream library.
Remember, your program needs to be robust to errors. Files may be
empty, etc. Please handle these cases gracefully and with no extra
output.
The noskipws operator may be useful in parsing the input:
input >> noskipws >> c;
Handing Project In
Your project will be handed in using the autograding system. Please see the autograder web page for details on how to submit your solution.
Project Writeup
No writeup required for this project.