Due: Friday, April 14th @ Recitation

Synopsis

In this project, you'll re-implement some standard unix utilities. In particular, you'll make clones (perhaps with some missing or different functionality) of wc,sort,uniq,shuf and tr.

Important Notes on Grading

Unlike prior assignments, you'll do this one collaboratively. Teams must be 3 - 4 people, and your teammates must be in the same recitation. Details:

Details

Again, we'll try to adhere to the unix philosophy: your program will read standard input, write standard output, and won't print much of anything that isn't necessary. Each of the tools we are making has a man page, accessible via man [command name], so I'll let you read those to get a general idea of what each one is supposed to do. In what follows, I will mainly outline the ways that your version will be different than the standard one (that is, which behaviors you don't need to emulate).

One simplification we'll use for now: you don't need to work with files. You can always just read stdin and write to stdout. Here are some details for each one of the programs.

uniq

For this, the only options you need to make work are -c (show the count before each line), -d (only show duplicated lines), and -u (only print unique lines).

wc (word count)

You should implement the standard -c,-l,-w,-L options, as well as a new -u option which is explained below. Without arguments, it should print the line, word, and byte counts, just as the system's wc. A few differences from the system's version:

  1. You don't need to process arguments (like filenames); you can assume that all the input comes from standard input.
  2. You don't need to format the output exactly the same way as wc: you can just separate each piece of the output by a tab \t character, or a few spaces (but don't print anything that isn't whitespace between the output fields).
  3. You should add a --uwords option which counts the unique words.

Here is some sample output from the program:

$ echo "this is a test" | ./wc
        1       4       15

The format of the output is

#lines  #words  #bytes  [longest line len]  #uniquewords

with each separated by whitespace (you can use tabs if you want). If no options are given, just print the first 3 items above, just like normal wc would do. Otherwise, print only the values corresponding to the flags your program was given, and output them in the same order as above.

shuf

This just permutes the lines of the input. It is important that each permutation is equally likely to be selected. You should implement the options -e,-i,-n.

sort

You just need to sort the lines of the file. The only options you need to worry about are -r (sort descending), -u (don't print duplicates) and -f (ignore the case).

tr

This translates characters from one set to another set. The only options you need to worry about are -c,-d. To make things easier, you don't need to worry about the character classes at all, although you should make your program work with at least the escape sequences \\,\n,\t.

Compiling and Testing

Makefiles

There's a makefile, so all you have to do to build the programs is execute make. If you want to try to reverse engineer the makefile, maybe look at the tutorials here and here.

Automated tests

The script ./rand-test.sh takes a list of programs you want to test, invents some random files to test them on, and then checks (using every combination of flags) that they do the same thing as the system versions. For example to test them all:

$ ./rand-test.sh wc tr sort shuf uniq

By default, it will delete the test files after a successful test. To keep them, do this:

$ save=1 ./rand-test.sh wc tr sort shuf uniq

These tests passing does not mean your programs actually work, but it is at least a good sign that they aren't totally broken! A few things to note, in particular:

Testing by hand

Good news - since you are reinventing wheels, working versions of all these programs are already on your computer! So it is quite easy to check your output on any input. E.g., to test my version of sort against that of the system's, using my wc.cpp as input:

$ sha1sum <(./sort <wc.cpp) <(sort wc.cpp)
b0e0f70a9399dd7097fa8d3f0b2397919aac33b9  /dev/fd/63
b0e0f70a9399dd7097fa8d3f0b2397919aac33b9  /dev/fd/62

And if there are some differences, you can examine them using vimdiff:

$ vimdiff <(./sort < wc.cpp) <(sort wc.cpp)

Two shell features to notice (both of which we've mentioned before, I think):

  1. Input redirection (the < filename stuff). This construct just takes the file and presents it to the program as stdin. Even though the only file your programs read is stdin, we can use this feature to send them any file.
  2. Process substitution (the kind of funky <(...) syntax). This just takes the output of one program and presents it to another program as a file. (Contrast this with a usual pipeline which sends one program's output to stdin of another.) Read more about it here. It can be pretty useful.

You can also construct complicated pipelines with some pieces written by you, and some from the system. For example, if I have written my own uniq, but I haven't yet done sort, I can still test it out (recall that uniq requires its input to be sorted) like this:

$ sha1sum <(cat *.cpp | sort | ./uniq -c -d) <(cat *.cpp | sort | uniq -c -d)
6d5d25046b627d468bf20a1f84f018e957e3ba58  /dev/fd/63
6d5d25046b627d468bf20a1f84f018e957e3ba58  /dev/fd/62

Hints

I would advise you to tackle these in the approximate order as they appear in what follows.

uniq

NOTE: this utility assumes that the input is already sorted. As a consequence, any duplicated entries will appear next to each other. This should allow you to write this program without using any fancy data structures, like vectors, sets, etc. Moreover, your program should consume a constant amount of memory, irrespective of the size of the input. Do not store the entirety of the input! Just print as you go. NOTE: this also shows a difference between sort -u and uniq: the memory footprint of sort -u will be larger than uniq, so if your input is already sorted, uniq is a better choice.

wc

General advice: a simple way to do this is to read one character at time and process it via a state diagram. Perhaps use something like the following:

Fig. 1: State Diagram

Fig. 1: State Diagram

Start in the node on the left hand side, and begin reading characters, one by one. As you read a character, you will move along the arrows as prescribed by the diagram. The diagram may look a little too simple to be useful, but note that what you want to do upon reading a character depends on what type of character you were reading previously - that is, it depends on which one of the nodes you are in! So once you've modeled the state diagram, all that remains is to add the right actions for each case. E.g., if you were just reading non-whitespace, and then read whitespace, you know you hit a word boundary, so increment the word count, and add the current word to your set of unique words. And if the non-whitespace character you just read was a newline (\n), then you should also increment the line count.

I realize the diagram doesn't look much like C++, but it is actually easy to convert. Just use an integer (or even boolean in this case) for which node you are in, and update it according to the diagram as you read characters. Using an enum is also common for situations like this. It lets you use memorable names instead of numbers, and without the hassle of setting up a separate variable for each state:

1
2
3
4
5
6
7
8
9
10
// with an enum:
enum states {ws,nws};

// the equivalent without an enum:
const int ws = 0;
const int nws = 1;

// using the enum, you can store the values in an int, or
// a variable of type "states":
int state = ws;

Unique words: For counting unique words, you can use a set. As you are traversing the arrows in the state diagram, keep adding new non-whitespace characters to a temporary string variable. Once you transition to the whitespace state, take whatever was in that variable and add it to a set. The set will take care of removing the duplicates for you.

Reading the input: the following looks plausibly correct:

1
2
3
4
char c;
while (cin >> c) {
    // process c
}

But you will probably not enjoy the results. This will do some strange things with whitespace, and will almost certainly throw off your results. The following should lead to fewer surprises:

1
2
3
4
char c;
while (fread(&c,1,1,stdin)) {
    // process c
}

Debugging unique words: it might be nice to just see what is in your set of unique words while testing. Remember that you can go through a set using iterators like we saw in lecture:

1
2
3
4
5
6
set<string> uwords;
// store unique words in uwords...
// dump contents of uwords:
for (set<string>::iterator i = uwords.begin(); i!=uwords.end(); i++) {
    cout << *i << endl;
}

The last word: you may have forgotten the last word. Many files end with a newline, so you might not run into this often. But see what happens when you do this:

echo -n "hello there" | ./wc -u

If this shows one word instead of two, see if you can fix it.

Counting tabs: if you compare your output of ./wc -L with the system's version on an input that contains tab characters, you might find some discrepancies. This is because -L attempts to calculate the longest line as it would be displayed, and assumes that a tab will be displayed using as many spaces as needed to reach the next multiple of 8 characters. Once you know this is the problem, it should be an easy fix: when you hit a tab, instead of simply incrementing, try something like len += 8-len%8 where len is the length of your current line.

shuf

General advice: It might not be obvious how to select a random permutation such that each permutation is equally likely. Here are a few hints. First, note that you will need to know the total number of lines before printing anything out, so I might recommend just storing all of stdin in a vector and then applying a random permutation. You can actually apply the permutation to the vector, or just permute the indexes and use that to determine the order for printing.

As for the random permutation, try this: pick a random thing to go first, and explicitly place it first. Then pick a random thing to go second, and explicitly place it second (but make sure that the first element stays put!). Continue like this until the last element of the vector. It is actually quite similar to our naive sorting algorithm from lecture: when sorting, we would place into V[i] the smallest element in V[i,i+1,...,size-1]. Here, we can do the same thing, but instead of the smallest, just use a random one. Also, you will need random numbers! Here's how to get some:

1
2
3
4
5
6
7
8
9
10
#include <cstdlib>
#include <time.h>

int main() {
    // stuff...
    srand(time(0)); // use current time to determine pseudorandom sequence
    // print 10 random numbers:
    for (int i=0; i<10; i++)
        cout << rand() << "\n";
}

Non-option arguments: this program, unlike most of the others, can accept non-option command line arguments (e.g. when -e is supplied). It isn't obvious how to get these things. Here's one way to go:

1
2
3
4
// if you are storing the input in a vector V, try this:
while (optind < argc)
    V.push_back(argv[optind++]);
// note: the variable `optind` is supplied in the `<getopt.h>` header.

A few more notes:

  1. Note that rand() gives pseudorandom numbers based on a seed, rather than actual random numbers. So each permutation will not really be equally likely, but that's alright.
  2. Calling rand() gives you (pseudo)random integers in the range [0,...,RAND_MAX]. The upper bound there is probably a lot larger than the indexes you have into your vector. You can get random numbers in a different range by reducing modulo the desired upper bound.
  3. If you have time try to solve1 and2 at the end of the page.

sort

There are a few strategies you could use for this. One way is to read the entirety of stdin into a vector of strings, and then sort the vector as we saw in class (or some other way if you want), and then use the flags on the command line to determine how you show the output. Other alternatives might be to use sets, multisets, or maps (storing each item with its multiplicity).

tr

General advice: I would recommend using either a map or a vector (of length 256) to store the translation of each input character. Once you have this map or vector set up, the rest should be a piece of cake! One word of caution if using a vector though: if you index elements with something of type char, note that characters higher than 127 are interpreted as negative numbers! Here's one way to fix it:

1
2
3
4
char c;
// ... c gets a value somehow ...
cout << V[(unsigned char)c]; // good :D
cout << V[c]; // bad x_x  don't do this.

Quick tests: if you need a quick way to see if your tr is working, maybe try something like this, which counts how many braces your cpp files have:

$ cat *.cpp | ./tr -c -d "{}" | wc -c
186
$ cat *.cpp | tr -c -d "{}" | wc -c
186
# or, see if the output is exactly the same:
$ sha1sum <(cat *.cpp | ./tr -c -d "{}") <(cat *.cpp | tr -c -d "{}")
11c789382255fb752602f8a50bbd046401c75496  /dev/fd/63
11c789382255fb752602f8a50bbd046401c75496  /dev/fd/62

What tr actually does: The tr command has some subtleties. Here is some instructive output of the system's tr command which might help you understand them:

$ echo "cba" | tr "abc" "def"
fed
$ echo "hello there" | tr "el" "ar"
harro thara

One thing to watch out for is the -c option. Think of it this way: characters in the first set will be preserved, and all others will be translated. The tricky thing is that the translation table will start at character 0, and have a few "holes" for the characters that are to be preserved. An example:

$ echo -e "\x00hello there" | tr -c 'el\n' 'xyz'
xzellzzzzeze

Notice how the 0 character \x00 was translated to a x, which happens to be the first character of the second set. Now see if you can make sense of this one:

$ echo -e "\x00hello there" | tr -c 'el\n' 'A-Z0-9'
A9ell9599e9e

The translation table is A,B,C,...,Z,0,1,2,...,9. The space character has ascii value 32. You might think that it should have been mapped to the item at index 32 in the table, but the careful observer will see that it actually mapped to 5, which is index 31. So what's going on? The list of characters to be translated has some holes in it. In particular it is missing the newline, which would have been at index 10. This effectively moves space back one index to 31. Here's a more complicated example:

$ echo -e "hello there" | tr -c 'el\n' '\0-\377'
felllqfeoe
$ echo -e "hello there" | tr -c 'el\n' '\0-\377' | xxd
00000000: 6665 6c6c 6c1f 7166 656f 650a            felll.qfeoe.

The translation table is just all characters in order (ascii value 0 to 255). You can see that the space was translated to 31,3 the 'h' was translated to 'f', which is two indexes behind, as there are now two holes in the input list, and 'r' mapped to 'o', which is 3 indexes behind, since the index of 'r' is after all three characters that we're leaving out.

A Few Exercises

Here are some things to try with the tools you just wrote.

Submission Procedure

Your assignment will be graded by the TAs soon after the deadline, possibly during recitation (so you should be prepared to explain what you wrote!). For full credit, your project should

  1. Pass the automated tests;
  2. Be hosted on bitbucket;
  3. Have non-trivial contributions from each team member (this should be reflected in the commit logs).

  1. Exercise: even if rand() produced uniformly random numbers in [0,...,RAND_MAX] (rather than pseudorandom), I claim that for many values of k, rand() % k will not quite be uniform on the range [0,...,k-1]. See if you can figure out why, and figure out which values of k would give the uniform distribution. Lastly, try to measure how far from uniform this distribution is in terms of RAND_MAX and k.]?

  2. Exercise: under the (faulty) assumption that rand() % k produces the uniform distribution on [0,...,k-1], try to prove that each permutation would be equally likely to be produced by your procedure.?

  3. Note that character 31 is a non-printable control character, which is why it doesn't show up in the first output, but after running it through xxd, we can see the character there in hex (it's the byte 1f).?

  4. For reference, I got away with 235. The project skeleton had 120.?