In this project, you'll re-implement some standard unix utilities. In particular, you'll make clones (perhaps with some missing or different functionality) of wc,sort,uniq,shuf and tr.
Unlike prior assignments, you'll do this one collaboratively. Teams must be 3 - 4 people, and your teammates must be in the same recitation. Details:
Again, we'll try to adhere to the unix philosophy: your program will read standard input, write standard output, and won't print much of anything that isn't necessary. Each of the tools we are making has a man page, accessible via man [command name], so I'll let you read those to get a general idea of what each one is supposed to do. In what follows, I will mainly outline the ways that your version will be different than the standard one (that is, which behaviors you don't need to emulate).
One simplification we'll use for now: you don't need to work with files. You can always just read stdin and write to stdout. Here are some details for each one of the programs.
For this, the only options you need to make work are -c (show the count before each line), -d (only show duplicated lines), and -u (only print unique lines).
You should implement the standard -c,-l,-w,-L options, as well as a new -u option which is explained below. Without arguments, it should print the line, word, and byte counts, just as the system's wc. A few differences from the system's version:
wc: you can just separate each piece of the output by a tab \t character, or a few spaces (but don't print anything that isn't whitespace between the output fields).--uwords option which counts the unique words.Here is some sample output from the program:
$ echo "this is a test" | ./wc
1 4 15The format of the output is
#lines #words #bytes [longest line len] #uniquewords
with each separated by whitespace (you can use tabs if you want). If no options are given, just print the first 3 items above, just like normal wc would do. Otherwise, print only the values corresponding to the flags your program was given, and output them in the same order as above.
This just permutes the lines of the input. It is important that each permutation is equally likely to be selected. You should implement the options -e,-i,-n.
You just need to sort the lines of the file. The only options you need to worry about are -r (sort descending), -u (don't print duplicates) and -f (ignore the case).
This translates characters from one set to another set. The only options you need to worry about are -c,-d. To make things easier, you don't need to worry about the character classes at all, although you should make your program work with at least the escape sequences \\,\n,\t.
There's a makefile, so all you have to do to build the programs is execute make. If you want to try to reverse engineer the makefile, maybe look at the tutorials here and here.
The script ./rand-test.sh takes a list of programs you want to test, invents some random files to test them on, and then checks (using every combination of flags) that they do the same thing as the system versions. For example to test them all:
$ ./rand-test.sh wc tr sort shuf uniqBy default, it will delete the test files after a successful test. To keep them, do this:
$ save=1 ./rand-test.sh wc tr sort shuf uniqThese tests passing does not mean your programs actually work, but it is at least a good sign that they aren't totally broken! A few things to note, in particular:
shuf, which is by nature supposed to have an unpredictable output.-u option for your wc program.tr are kind of limited.Good news - since you are reinventing wheels, working versions of all these programs are already on your computer! So it is quite easy to check your output on any input. E.g., to test my version of sort against that of the system's, using my wc.cpp as input:
$ sha1sum <(./sort <wc.cpp) <(sort wc.cpp)
b0e0f70a9399dd7097fa8d3f0b2397919aac33b9 /dev/fd/63
b0e0f70a9399dd7097fa8d3f0b2397919aac33b9 /dev/fd/62And if there are some differences, you can examine them using vimdiff:
$ vimdiff <(./sort < wc.cpp) <(sort wc.cpp)Two shell features to notice (both of which we've mentioned before, I think):
< filename stuff). This construct just takes the file and presents it to the program as stdin. Even though the only file your programs read is stdin, we can use this feature to send them any file.<(...) syntax). This just takes the output of one program and presents it to another program as a file. (Contrast this with a usual pipeline which sends one program's output to stdin of another.) Read more about it here. It can be pretty useful.You can also construct complicated pipelines with some pieces written by you, and some from the system. For example, if I have written my own uniq, but I haven't yet done sort, I can still test it out (recall that uniq requires its input to be sorted) like this:
$ sha1sum <(cat *.cpp | sort | ./uniq -c -d) <(cat *.cpp | sort | uniq -c -d)
6d5d25046b627d468bf20a1f84f018e957e3ba58 /dev/fd/63
6d5d25046b627d468bf20a1f84f018e957e3ba58 /dev/fd/62I would advise you to tackle these in the approximate order as they appear in what follows.
NOTE: this utility assumes that the input is already sorted. As a consequence, any duplicated entries will appear next to each other. This should allow you to write this program without using any fancy data structures, like vectors, sets, etc. Moreover, your program should consume a constant amount of memory, irrespective of the size of the input. Do not store the entirety of the input! Just print as you go. NOTE: this also shows a difference between sort -u and uniq: the memory footprint of sort -u will be larger than uniq, so if your input is already sorted, uniq is a better choice.
General advice: a simple way to do this is to read one character at time and process it via a state diagram. Perhaps use something like the following:
Fig. 1: State Diagram
Start in the node on the left hand side, and begin reading characters, one by one. As you read a character, you will move along the arrows as prescribed by the diagram. The diagram may look a little too simple to be useful, but note that what you want to do upon reading a character depends on what type of character you were reading previously - that is, it depends on which one of the nodes you are in! So once you've modeled the state diagram, all that remains is to add the right actions for each case. E.g., if you were just reading non-whitespace, and then read whitespace, you know you hit a word boundary, so increment the word count, and add the current word to your set of unique words. And if the non-whitespace character you just read was a newline (\n), then you should also increment the line count.
I realize the diagram doesn't look much like C++, but it is actually easy to convert. Just use an integer (or even boolean in this case) for which node you are in, and update it according to the diagram as you read characters. Using an enum is also common for situations like this. It lets you use memorable names instead of numbers, and without the hassle of setting up a separate variable for each state:
1 2 3 4 5 6 7 8 9 10 | |
Unique words: For counting unique words, you can use a set. As you are traversing the arrows in the state diagram, keep adding new non-whitespace characters to a temporary string variable. Once you transition to the whitespace state, take whatever was in that variable and add it to a set. The set will take care of removing the duplicates for you.
Reading the input: the following looks plausibly correct:
1 2 3 4 | |
But you will probably not enjoy the results. This will do some strange things with whitespace, and will almost certainly throw off your results. The following should lead to fewer surprises:
1 2 3 4 | |
Debugging unique words: it might be nice to just see what is in your set of unique words while testing. Remember that you can go through a set using iterators like we saw in lecture:
1 2 3 4 5 6 | |
The last word: you may have forgotten the last word. Many files end with a newline, so you might not run into this often. But see what happens when you do this:
echo -n "hello there" | ./wc -uIf this shows one word instead of two, see if you can fix it.
Counting tabs: if you compare your output of ./wc -L with the system's version on an input that contains tab characters, you might find some discrepancies. This is because -L attempts to calculate the longest line as it would be displayed, and assumes that a tab will be displayed using as many spaces as needed to reach the next multiple of 8 characters. Once you know this is the problem, it should be an easy fix: when you hit a tab, instead of simply incrementing, try something like len += 8-len%8 where len is the length of your current line.
General advice: It might not be obvious how to select a random permutation such that each permutation is equally likely. Here are a few hints. First, note that you will need to know the total number of lines before printing anything out, so I might recommend just storing all of stdin in a vector and then applying a random permutation. You can actually apply the permutation to the vector, or just permute the indexes and use that to determine the order for printing.
As for the random permutation, try this: pick a random thing to go first, and explicitly place it first. Then pick a random thing to go second, and explicitly place it second (but make sure that the first element stays put!). Continue like this until the last element of the vector. It is actually quite similar to our naive sorting algorithm from lecture: when sorting, we would place into V[i] the smallest element in V[i,i+1,...,size-1]. Here, we can do the same thing, but instead of the smallest, just use a random one. Also, you will need random numbers! Here's how to get some:
1 2 3 4 5 6 7 8 9 10 | |
Non-option arguments: this program, unlike most of the others, can accept non-option command line arguments (e.g. when -e is supplied). It isn't obvious how to get these things. Here's one way to go:
1 2 3 4 | |
A few more notes:
rand() gives pseudorandom numbers based on a seed, rather than actual random numbers. So each permutation will not really be equally likely, but that's alright.rand() gives you (pseudo)random integers in the range [0,...,RAND_MAX]. The upper bound there is probably a lot larger than the indexes you have into your vector. You can get random numbers in a different range by reducing modulo the desired upper bound.There are a few strategies you could use for this. One way is to read the entirety of stdin into a vector of strings, and then sort the vector as we saw in class (or some other way if you want), and then use the flags on the command line to determine how you show the output. Other alternatives might be to use sets, multisets, or maps (storing each item with its multiplicity).
General advice: I would recommend using either a map or a vector (of length 256) to store the translation of each input character. Once you have this map or vector set up, the rest should be a piece of cake! One word of caution if using a vector though: if you index elements with something of type char, note that characters higher than 127 are interpreted as negative numbers! Here's one way to fix it:
1 2 3 4 | |
Quick tests: if you need a quick way to see if your tr is working, maybe try something like this, which counts how many braces your cpp files have:
$ cat *.cpp | ./tr -c -d "{}" | wc -c
186
$ cat *.cpp | tr -c -d "{}" | wc -c
186
# or, see if the output is exactly the same:
$ sha1sum <(cat *.cpp | ./tr -c -d "{}") <(cat *.cpp | tr -c -d "{}")
11c789382255fb752602f8a50bbd046401c75496 /dev/fd/63
11c789382255fb752602f8a50bbd046401c75496 /dev/fd/62What tr actually does: The tr command has some subtleties. Here is some instructive output of the system's tr command which might help you understand them:
$ echo "cba" | tr "abc" "def"
fed
$ echo "hello there" | tr "el" "ar"
harro tharaOne thing to watch out for is the -c option. Think of it this way: characters in the first set will be preserved, and all others will be translated. The tricky thing is that the translation table will start at character 0, and have a few "holes" for the characters that are to be preserved. An example:
$ echo -e "\x00hello there" | tr -c 'el\n' 'xyz'
xzellzzzzezeNotice how the 0 character \x00 was translated to a x, which happens to be the first character of the second set. Now see if you can make sense of this one:
$ echo -e "\x00hello there" | tr -c 'el\n' 'A-Z0-9'
A9ell9599e9eThe translation table is A,B,C,...,Z,0,1,2,...,9. The space character has ascii value 32. You might think that it should have been mapped to the item at index 32 in the table, but the careful observer will see that it actually mapped to 5, which is index 31. So what's going on? The list of characters to be translated has some holes in it. In particular it is missing the newline, which would have been at index 10. This effectively moves space back one index to 31. Here's a more complicated example:
$ echo -e "hello there" | tr -c 'el\n' '\0-\377'
felllqfeoe
$ echo -e "hello there" | tr -c 'el\n' '\0-\377' | xxd
00000000: 6665 6c6c 6c1f 7166 656f 650a felll.qfeoe.The translation table is just all characters in order (ascii value 0 to 255). You can see that the space was translated to 31,3 the 'h' was translated to 'f', which is two indexes behind, as there are now two holes in the input list, and 'r' mapped to 'o', which is 3 indexes behind, since the index of 'r' is after all three characters that we're leaving out.
Here are some things to try with the tools you just wrote.
wc, sort, uniq).tr,sort and uniq to print a list of unique words. Note that your version of wc only counts the unique words and doesn't have an option to print the list.tr and wc to count how many semicolons are in all of your cpp files.4Your assignment will be graded by the TAs soon after the deadline, possibly during recitation (so you should be prepared to explain what you wrote!). For full credit, your project should
Exercise: even if rand() produced uniformly random numbers in [0,...,RAND_MAX] (rather than pseudorandom), I claim that for many values of k, rand() % k will not quite be uniform on the range [0,...,k-1]. See if you can figure out why, and figure out which values of k would give the uniform distribution. Lastly, try to measure how far from uniform this distribution is in terms of RAND_MAX and k.]?
Exercise: under the (faulty) assumption that rand() % k produces the uniform distribution on [0,...,k-1], try to prove that each permutation would be equally likely to be produced by your procedure.?
Note that character 31 is a non-printable control character, which is why it doesn't show up in the first output, but after running it through xxd, we can see the character there in hex (it's the byte 1f).?
For reference, I got away with 235. The project skeleton had 120.?