Lab 8

This assignment is adopted from Exercise 6.6 in Algorithm Design by Kleinberg and Tardos (2006). The following introductory text is partly taken from the book.

Overview

The goal of this lab is to design a Dynamic Programming algorithm to align text as evenly to the right as possible. To do this, you will start by coming up with a greedy algorithm, and explain the cases in which the algorithm generates incorrect output. Then you will design a recursive algorithm. Finally, you will identify the redundant computations in the algorithm and use that to design a bottom up Dynamic Programming algorithm. You will implement this algorithm in HW7.

Task description

In a word processor, the goal of “pretty-printing” is to take text with a ragged right margin, like this:

Call me Ishmael.
Some years ago,
never mind how long precisely,
having little or no money in my purse,
and nothing particular to interest me on shore,
I thought I would sail about a little
and see the watery part of the world.

and turn it into text whose right margin is as “even” as possible, like this:

Call me Ishmael. Some years ago, never
mind how long precisely, having little
or no money in my purse, and nothing
particular to interest me on shore, I
thought I would sail about a little
and see the watery part of the world.

More concretely, here is the input-output mapping.

Input

List of words. For example the paragraph above would be represented as: ['Call', 'me', 'Ishmael', ..., 'of', 'the', 'world']
List of word lengths. For example, the word length list corresponding to the list of words would be [4, 2, 7, ..., 2, 3, 5]
Slack functor. This is a function f that is used to measure the slack (or “evenness”) of the line. Concretely, given a line with words i to j, slack of the line is f(L - (w_i, w_i+1, w_i+2 ... w_j)), where L is the maximum number of letters that will fit on any line.

Output

A list of line indices that correspond to the last word on each line for the optimal solution (i.e., the solution that minimizes the slack across all lines). Therefore, if there are k lines, then there should be k-1 indices.

Assumptions

A word is a contiguous sequence of non-whitespace characters, for example, Ishmael., years, or ago,.
Assume we’re doing this in the context of a fixed-width font; so, we’re just concerned with the number of characters that end up on each line and we don’t need to know how much “space on a page” that a character consumes.
In the output, we assume words on a line are separated by a single space.
We are not allowed to separate or hyphenate words.

Pre-lab

If the maximum number of characters that can fit on a line (i.e, L) is 40, calculate the overall slack (i.e., sum across all lines) for the two blocks of text above when:
- Slack functor is linear i.e., f(x) = x
- Slack functor is square i.e., f(x) = x^2
Assuming that the second block of text is the optimal solution, describe what the output would be.

Step 0: Creating a markdown file for writeups

One person in your team should visit https://demo.hedgedoc.org/new to create an empty document
Share the URL of the document created with the others in your team (e.g., email them the link in the address bar). When someone else visits the URL, they are able to collaboratively edit the document.
Each person should copy and paste the URL of the document into the Moodle submission box for the “Lab 1” assignment. Do this right away, even before writing in the document, so that your course account has a link to your work.

Step 1: Write a Greedy Algorithm

Write a greedy algorithm for this task when you assume that f(x) = x (i.e., a linear function), and then justify why this greedy algorithm is not guaranteed to work for all functions. In your writeup include:

The greedy rule.
A justification of why the greedy rule works in the linear case.
Counter examples that demonstrate that the greedy rule does not work for at least one other function.

Step 2: Design a Recursive Algorithm to find optimal slack

Formulate the problem recursively. Write down a recursive formula or algorithm for the whole problem in terms of the answers to smaller subproblems. A complete recursive formulation has two parts:

Specification Describe the problem that you want to solve recursively, in coherent and precise English—not how to solve that problem, but what problem you’re trying to solve. Without this specification, it is impossible, even in principle, to determine whether your solution is correct.
Solution Give a clear recursive formula or algorithm for the whole problem in terms of the answers to smaller instances of exactly the same problem.

Note, for this part do not worry about returning the actual list of line breaks – just write the recurrence relation to calculate the optimal slack.

Hint: If you are feeling stuck, it might help to try and find parallels with other Dyanmic Programming Algorithms we’ve covered in class, such as the weighted interval scheduling problem

Step 3: Design a Dynamic Programming Algorithm to find optimal slack

Identify the redundant operations in your recursive algorithm and use that to create a Dynamic Programming version of the algorithm. The following steps might be helpful. Note, for this part do not worry about returning the actual list of line breaks – just write an algorithm that can calculate the optimal slack.

Identify the subproblems. What are all the different ways your recursive algorithm can call itself, starting with some initial input?
Choose a memoization data structure. Find a data structure that can store the solution to every subproblem you identified in step (a). This is usually but not always an array (possibly multidimensional).
Identify dependencies. Except for the base cases, every subproblem depends on other subproblems—which ones? Draw a picture of your data structure, pick a generic element, and draw arrows from each of the other elements it depends on. This will help with the next step.
Find a good evaluation order Order the subproblems so that each one comes after the subproblems it depends on. You should consider the base cases first, then the subproblems that depends only on base cases, and so on, eventually building up to the original top-level problem.
Analyze space and running time. The number of distinct subproblems determines the space complexity of your memoized algorithm. To compute the total running time, add up the running times of all possible subproblems, assuming deeper recursive calls are already memoized.
Write down the algorithm. You know what order to consider the subproblems, and you know how to solve each subproblem. So do that! If your data structure is an array, this usually means writing a few nested for-loops around your original recurrence, and replacing the recursive calls with array look-ups.

Step 4: Modify the Dynamic Programming Algorithm to find line breaks

In the previous step your algorithm only identified the optimal slack. In this part modify the algorithm so you can keep track of the sequence of steps that got you to the optimal slack. From this sequence of steps, you should be able to recover the indices of the line breaks.

Step 5: Submit the writeup

Each group member should submit the same markdown file to Moodle. In the submission comment add a link to the hedgedoc file and also indicate who you worked with.

Step 6: Understanding the program requirements

Once you are done submitting the writeup, spend some time understanding the requirements for the Java implementation of this algorithm.

Starter files

The following starter are provided:

PrettyPrint.java contains the declaration of the function you should implement (splitWords). It also contains an example main function which reads the input sequence of words from standard input (or a file) and uses the array returned from splitWords to print the output with the specified breaks.
SlackFunctor.java contains the interface definition for SlackFunctor. (described below)
Several files ending with .dat which are used by ComputeScores as test cases.

splitWords

The input to this function consists of three parameters:

An integer array lengths of length n. length[i] represents the number of characters in the i-th word of the input. Note that these numbers don’t include any whitespace that may have been in between the original words (as that is irrelevant in our problem). For example, if the input sequence of words is:

Call me Ishmael. Some years ago, never mind how long the input array lengths = {4, 2, 8, 4, 5, 4, 5, 4, 3, 4}.

An interget L that gives the maximum length of any line in the output.
An object of type SlackFunctor which specifies what the function to be used to compute the slack is.

For example, if we want the goal to be based on the sum of the cubes of the slacks, with a maximum line length of 20 characters, then we would call splitWords as follows:

int[] lastWords = PrettyPrint.splitWords(lengths, 20,
        new SlackFunctor() {
            public double f(int slack) { return Math.pow(slack,3); }
    });

The output to this function should be a list of indices corresponding to the last word on each line, in order of the lines, in an optimal solution. So, if the optimal pretty-printing for the above sequence is determined to be

Call me Ishmael.
Some years ago,
never mind how long

then the list returned should be 2,5,92,5,9 because the last word on the first line is the third word of the input sequence (its length was at index 2 in the input, because of zero-based indexing), the last word in the second line is the sixth word of the input sequence, and so on.

If a word in the input is longer than L, then the function should return null, as no pretty-printing with that margin is possible.

Step 7 (homework): Coding, Testing, Program Submission

The Moodle page has a link to submit your code. Please submit your PrettyPrint.java file only.

You do not need to submit the SlackFunctor.java file.

Scoring

To get scores call “java ComputeScores” after compiling your source code. It will call the splitWords functions with different inputs.