Tuesday, October 25, 2016

How I Wrote my own Pre-Edit Tool in Java

Here's the manual, which describes why I set off on this quest and the features I chose.

PreEdit version 1_0_5

User Manual

copyright 2016 by Scott Rhine

Copyright 2016 by Scott Rhine

This software may be used on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expressed or implied. None of this is considered a replacement for hiring a human editor to correct your work. You can report problems or request new features by sending correspondence to ScottRhineBooks@gmail.com

1.0 Introduction

This tool analyzes fiction text to help eliminate problems before sending it to an editor. It’s based on the automated checklist I use before submitting my own work. To use it, bring up the graphical user interface (GUI), select text from the desired document (control C), and paste (control V) into the top window. Sometimes, I tweak the text by getting rid of the table of contents at the start or the notes at the end of a draft. To generate a report, click the Analyze button. The results will appear in the text window below in a few seconds. (See screenshot.)

The summary provides an overview of items like grade level and counts of elements you may wish to polish. It also shows a list of difficult chapters which are significantly higher than the average grade level of the rest of the novel.

1.1 Radio Buttons

By clicking the eight radio buttons, you can include or exclude six subreports:
  • adverb – Lists all adverbs in your document ordered by use count. Aim for under 13.5 per thousand by fixing most of the first four on the list.
  • thats – List of “that” occurrences that could be removed, with a count of each.
  • compound – Lists side-by-side or hyphenated words which may be combined to form a single compound word, such as pickup. Note that often the spelling choice depends on intended meaning and context (to lift vs. a vehicle), which is why human examination of each is necessary. The most common compounds such as onto, into, and upon have been filtered out, as I trust an author to recognize these differences already.
  • weak – Lists weak words and phrases which could be shortened or replaced: lift up, attempted to, begin to, grow in size, almost, and so forth.
  • phrases – List of three-word phrases repeated throughout the document.
  • long – Lists any sentences over 55 syllables, ordered by length. Aim to eliminate all of these.
  • hard – Lists sentences 10 words+ and 1.75+ average syllables, ordered by GF grade level.
  • repeat – Lists all words (other than common nouns which may be the subject of the paragraph) that repeat within a 30-word span. Some are intentional echoes inside dialogue, but most are legitimate mistakes and all should be checked.
Once a report is generated, you can still adjust what appears by clicking the radio buttons. Not every report is useful for everybody. I like to repeat the checks after correcting the contents of the initial report.

1.2 Printing

The results may be selected and copied to a file or email of your choice. They can also be printed using the Print button. The Print button will not activate until an Analyze has been done. Note that default Java print function has 38 lines per page while WordPad uses closer to 60. Which one you use depends on how you like documents formatted while editing. I recommend printing the list in some fashion because shortening sentences feels easier for me with pen and paper.

1.3 About

The About button pops up a version of the introduction to give people the basics.

2.0 Motivation

I looked at a lot of tools before building this one. This is a partial list to give a feel for my frustration. These are only my opinions.

Word 360
The latest Word’s grade level seemed inaccurate and wouldn’t work until after a complete spell-check finished. Having lost the recent version I owned when a laptop crashed, I refused to pay a monthly fee for something I have a working, free 2007 version of.
The Hemingway app and others demos seem overly concerned with passive voice, although it allows up to 20 percent of the document to be written that way. If I’m under 3 percent, I don’t want to hear the complaints, especially in dialogue. People talk that way, and a program should be able to shut features off inside quotes. Their adverb count is very low and misses a lot of them. Their grade level seems to be a simple Flesch-Kincaid, though their word count is low, which makes me suspect the accuracy. Worse, the program has a bug that gets the level wrong when you paste in a text file rather than a Word file. I also wanted to see reading levels to the nearest tenth of a grade so I know how far I have to go in my smoothing efforts. They want $10 for the desktop version.

Only allows a demo of one chapter. The code is extremely slow and sometimes never finishes. It only gives one sample per person before you have to buy. But I noticed everyone has a similar Java interface.

This site will give you just the FK readability level.

Offers to send you a free sample of your output if you give them your email address. After about 40 seconds, I got rated between good and great for my adverb usage, a meaningless metric. It displayed a histogram of my adverbs by highest usage but undercounted my adverbs by a factor of two. On the flip side, they highlighted my 7 uses of “directly” in 80 thousand words as too many. They want $30 a month. I wasn’t impressed. If I joined premium, they offered a wide range of fancy features similar to https://iwl.me/ that sounded cool but had little practical use. I don’t need someone to check for clich├ęs. If my character used one, it was intentional. I also seriously doubt a computer can tell where you need more description.

After the Deadline offers free tools for spelling and grammar. It’s grammar/style suggestions were horrible and the spell-checker flagged all names and legit compound words.

I tried this a while back but didn’t like the Premium Word plug-in. The number of false positives is incredible. It was right about twice per chapter. Again, proper names, hyphens, and compound words confuse it. I would recommend checking on a per-chapter basis with their free Chrome plug-in, but don’t pay for it. A chapter takes about 8 seconds to process in Gmail.

In short, I was willing to boost my spell-checker, but everything else, I was better off writing a tool for myself. I started with my own checklist and added a few features that others asked for.

3.0 Detailed Usage Example

The best was to see how it helps is to use the tool. I’ll give detailed examples for each feature. This process is to be done after the draft is complete. To apply this kind of chainsaw to a work in progress could stunt the creative process. Also, you’d only have to repeat the process later. The numbers below are from my upcoming book “Quantum Zero Sentinel.”

3.1 Summary

Most of these metrics are the sort that Word prints when you do a Tools/Word Count on the pull-down menu. I made sure these values matched Word’s very closely. For now, my tool ignores numerals. I played with a lot more metrics and statistics, but they turned out to be less useful.

The summary gives you a fair idea where your weak points are.  I included average words per chapter and the length/name of the biggest chapter. If the maximum is over twice the average, consider splitting it if there’s a natural scene break. The part-of-speech density goals appear beside the metric and are calibrated from the last twenty books I wrote. The Gunning Fog goal between 7 and 8 is a well-established range.

syllables 118209
words 82180
sentences 8183
paragraphs 3179
chapters 67
average words per chapter 1227
longest chapter 2138       29. The Hanging Judge
adverbs 1301 per thousand words 15.83    (goal <= 13.5)
thats 325 per hundred sentences 3.97     (goal <= 3)
weak phrases 377
Flesch-Kincaid grade level 4.9
Gunning Fog grade level    7.8    (goal 7 - 8)

Difficult Chapters
   9.2   Prologue–Sales Pitch
   9.7   5. Adventures in Babysitting
   10.3   12. Business as Usual
   9.9   14. Memory of a Lifetime
   8.9   25. Disruptions
   9.1   36. Old Money
   9.3   38. Snowball Fight
   9.0   39. My Weird Meter Goes to Eleven
   8.8   47. Industrial Pollution
   9.0   64. Hero’s Welcome
Flesch-Kincaid is the best metric for readability. The FK Grade Level test gives the approximate US grade school level of the document. A score of 8.0 means that an eighth grader can understand the document. The formula is:
(.39 x ASL) + (11.8 x ASW) – 15.59
Where ASL is average sentence length in words and ASW means average syllables per word. Counting syllables can be difficult, as English is a language full of special cases. For example, “Wicked” could be two if it means evil, or one if it refers to liquid soaking through fabric or string. There are weak points in this model. Every second grader knows the word “vocabulary” (5 syllables) but not the one syllable “pith.”

To balance this, I also give the Gunning Fog version of grade level. Average sentence length is the same; however, it has a much simpler syllable method, calling a word “hard” if it’s 3 syllables or more. The value PHW is percent hard words.
.4 * (ASL + PHW)

Notice that GF is usually higher than FK. The truth is probably somewhere in between.
Why does this work? Think of someone who has a name over three syllables: Virginia, Jonathan, or Gregory. Unless they’re the king, does anybody use that full name? No. People are inherently lazy. They give a nickname like Ginny, Jon, and Greg. What else triggers this filter? Adverbs like unfortunately or five dollar words like ascertain. The majority of young readers still subvocalize, and they will tend to skip over these words. However, even the most thorough word-choice comb would lower the score by less than one grade. The best approach is to work on your outliers, chapters that score more than one level above the average. These problems are more structural in nature and usually indicate a boring chapter.
  1. Too much techno-babble that needs to be thinned.
  2. Lack of dialogue.
  3. Lack of action, which is normally conveyed in short, direct sentences.
  4. Too small a sample. The model breaks down under 500 words.
The list of outliers reveals something critical, the prologue. The first thing the user reads shouldn’t be on the hard list! The first three chapters should be the simplest and slickest.

After a pass of aeration and trimming, it looked like this—a vast improvement. Oddly, lowering the overall average made new chapters pop up. These are close enough to one grade level from the average to stop.

Flesch-Kincaid grade level 4.7
Gunning Fog grade level    7.5    (goal 7 - 8)

Difficult Chapters
   8.6   5. Adventures in Babysitting
   8.6   14. Memory of a Lifetime
   8.6   20. Miss Direction
   8.6   21. Board Meeting
   8.5   35. The All-Seeing Eye

3.2 Adverbs

This section was the most embarrassing. You can tell what words you’re addicted to in a hurry. I only print the words that are over 5 percent of our goal line. Who cares about the ones you’ve only used a few times? The reader won’t even notice them. Adverbs are extremely easy to whittle down by addressing the worst abuses first. Once you see how many you’ve used, guilt will help you to be merciless.

adverbs 1301 per thousand words 15.83    (goal <= 13.5)
Adverbs by Popularity
176 only
140 just
89 now
68 already
68 here
66 never
54 well

adverbs 952 per thousand words 11.89     (goal <= 13.5)
Adverbs by Popularity
108 only
68 just
54 now

Since it is now below the desired threshold, I would turn off the radio button for this subreport.

3.3 Thats

One can’t simply go on a rampage against all uses of “that.” There are too many and most of them are necessary. This tool helps you find specific instances that are likely to be fluff.
Often it takes the form of a “feeling-verb” that or “noun that pronoun verb.”
thats 325 per hundred sentences 3.97     (goal <= 3)
Removable Instances of That
18 that the
14 that she
8 know that
7 that he
6 that they
6 that we
5 that a
5 that her
4 fact that
4 so that
4 that I
4 that it
3 afraid that
3 realize that
3 sure that

F. Scott Fitzgerald said to “murder your darlings.” That’s too vague. This specific list allows me to target the dark minions hiding among us. To quote the movie They Live, “I’m here to kick ass and chew bubblegum. And I’m all out of bubblegum.”
thats 204 per hundred sentences 2.48     (goal <= 3)
Removable Instances of That
5 that she
5 that the
3 that a
3 that he
3 that it
3 that we

3.4 Compound Words

One of my biggest flaws is not recognizing simple, two-syllable compound words while I’m writing. I always wanted some magic button in Word to check for possible compounds between nearby words. Now I have that magic want. I assembled about 3000 candidates from my last twenty novels and added a few from various English for beginners websites. Check them in Webster’s yourself. The one flaw in this feature is that they often depend on the part of speech. Nouns are the most frequently combined. The adjective would often be hyphenated, while a verb is kept separate. For example, “I pick up my wife in a pickup, and she responds with a pick-up line.” Note that these are only right about a third of the time, but there were so many that I feel I have been saved a great deal of embarrassment.

Possible Compound Words
6 pickup
4 comeback
4 lookup
4 makeup
3 shutdown
2 holdup
2 longtime
2 sometime
2 takedown
1 backdoor
1 backseat
1 backup
1 barstool
1 birthplace
1 blackballed
1 blackmail
1 boardroom
1 bookends
1 buildup
1 carwash
1 chokepoint
1 darkroom
1 doorframe
1 drugstore
1 ductwork
1 facedown
1 fireproof
1 flagpole
1 flatlined

3.5 Weak Words and Phrases

This section is probably the most subjective. People have asked to add their own personal favorites to the list, but that’s difficult to do while keeping the product install simple. However, I think the results speak for themselves. This is one of the longest reports, with no lower threshold for the number of occurrences, but you can sharpen your story as much or as little as you want. The word “about” happens in so many contexts, I had to restrict it to specific cases, like vague numbers.
weak phrases 377
Weak Phrases
130 know
22 sort of
21 area
18 this is a
16 all of
12 rather
10 at all
9 and then
9 get to
8 which is
7 find out
7 individual
7 quite
7 somehow
5 kind of
5 show up
5 try to
5 about one

weak phrases 278
Weak Phrases
121 know
10 rather
9 find out
9 this is a
8 area
7 and then
7 which is
6 has to be
6 kind of
6 quite
5 have to be
5 somehow
5 sort of

3.6 Repeated Phrases

When I read Edgar Rice Burroughs, I marveled at how much he repeated himself over the span of a book. “I found in my opponent no mean swordsman.” As for me, I tend to repeat dialog tags. Use of any given phrase isn’t wrong, but repeat it ten times and it will rankle someone. Often the crutch phrases are weak or inaccurate. I only check outside dialog for clusters of three words in common. To scale it, I make the threshold based on the total number of triples in the document, with a minimum of two.
Globally Repeated Phrases
18 shook his head
16 in front of
13 shook her head
11 be able to
11 for a moment
11 raised an eyebrow
11 side of the
10 reminded her of
10 the rest of

I cringed every time I recognized the truth of one of these. Afterward, the document didn’t trigger the threshold for this report.

3.7 Long Sentences

How long is too long? When I narrated my first audiobook, I found out exactly how long. If you can’t say it in a single breath in a single take without stumbling or passing out, fix it. Why? Because many people still read aloud under their breath. It’s a hard habit to break. You don’t want your customers falling over. I set the bar at 55 syllables. Beyond that level is too complex and blows your grade level score, too. The only time I allow myself to exceed this limit is when I have a rare semicolon separated list. In this case, I could take a breath during a pause at the semicolons. Check out the sentences below to see if you agree.

Longest Sentences by Syllable Count
66 All of our automated factory lines utilize third generations to perform quality control inspections as well as assemble the computers with molecular layering techniques and microwave-laser precision no human being could duplicate
64 Given the driver’s ethnicity type of vehicle a certain radio station bumper sticker and distance from owner’s home I could design a program which would guarantee that a sizable fraction of traffic stops find drugs or a gun in the glove box
61 Provided a female human comes from a rare subgroup of black-haired black-eyed people with more than 5 percent Denisovan genes one in thirty-six fertilizations from a male Denisovan could result in a viable non-sterile offspring
58 “Because the memory log will record the Aeon signature of every user and victim together with a timestamp similar to the history feature in police Tasers” Maia ordered in her best older-sister voice
57 In this state a nonviolent felony of that class might earn Antreou a year of prison plus a five hundred dollar fine but with proper representation and the lack of hard evidence he’d probably end up with probation
55 The gun would lock onto whatever image you locked into the crosshairs and continued to fire every time that image drift back into the scope—even if the soldier carrying died or became incapacitated

Again, once you scrub the offenders, this report no longer appears. It can take a few passes for some of them.

3.8 Difficult Sentences

This is a tricky area. I seldom agree with the opinions highlighted by other software. Why? Because syllable-based methods are statistical and break down below about five hundred words. For example, the sentence “Esmerelda wept” was an astronomical 28th grade level because the percentage of hard words was over half. If a sentence is below ten words, I assume that someone won’t have problems reading it. I also established a minimum average syllables per word through trial and error. Then the program reports anything over 20th grade level. I haven’t had false positives since, but use your own judgment.

Difficult Sentences
25 Unfortunately this meant inviting Doctor Antreou to attend because he was the designated medical supervisor
24 Relieved Esperanza chatted with gentlemen from the Colombian Agriculture Department
24 Your official salaried position will be secretary in charge of Legacy press releases
23 Engineers need natural gas and several explosive expensive poisons to make the same fertilizer
23 Pharaohs are particularly difficult because the family tree goes straight up
23 You can apply this mechanism to designer drugs encryption or even predicting particle interaction in a supercollider
22 Lifted from the radio project’s interface a couple of monitors displayed vital statistics for six individuals identified by their initials
22 Mom taught violin at the prestigious Juilliard Academy and frequently performed at Lincoln Center in both classical and electronic genres
21 “Because of the unique microscopic broadcast methods the quantum resonance frequency is customized to each person

In a few cases I could use a shorter nickname for someone, but usually I decided I was trying to stuff too much into one sentence.

3.9 Locally Repeated Words

My pet peeve when reading popular YA fiction is how under-edited it can be. I cringe every time someone “suggests a suggestion” or admires the “long length of a leg.” To avoid such jangling prose, the code scans every word within the last thirty. I tried as high as fifty but it reported far too many. I think we can agree that using a rare word within three sentences or a paragraph can trigger a user’s gag reflex, especially on the same line. I thought I was immune to this, but I did it all the time! I was horrified how juvenile it sounded. The biggest trick was weeding out false positives. Clearly, we can already filter adverbs and thats. Any word four characters can easily go undetected. Short, sarcastic echoes in dialog should also be ignored. Lastly, nouns that are the subject of the paragraph should be filtered. This took the longest to train. I left in weak words like “started”, “wanted”, “needed”, or “going” because they should be pulled like weeds. Gerunds are tricky. I left them for the reader to decide because one might be used in verb form like “computing” below. I also matched different forms of a word like “managed by a manager.” This had the interesting side effect of triggering on rhymes as well. I removed one instance of “clicking” and “licking” to avoid sounding like an Edgar Alan Poe poem.

Locally Repeated Words (marked with *)
    Instead of turning toward the Manhattan ferry he steered the opposite direction toward* a sea of warehouses
    Gravel spit as he spun the vehicle toward* the Holland Tunnel
    As Maia rode through town she passed the boarded up Wal Mart a reminder of how rough* times were and how much she needed this job
    How did you feel when Beijing started* dumping US debt
    It’s going to hurt for a while but we’re all going* to survive
    The archaeology department needed* a tech to run their magnetosonic probe equipment for a dig site this summer
    They draft plans manage* budgets and oversee all the contractors to bring their vision to life
    You can’t possibly judge* her work qualifications from trivia and pretty eyes that bat their lashes at you
    Parallel computing* is like harnessing a hundred bunnies for the same task
    I can’t thank you enough* Greta
    We need to suppress this until we can have a plan to deal with the dislocation it’s going* to cause
    Who could afford* not to

There are still a few false matches left to ignore, but the ugly ones it finds are well worth the effort.

4.0 Conclusion

I thoroughly enjoyed writing and using this tool on my latest story. I even went back to “The K2 Virus” to trim the repeated words and fix compounds. I felt much better about the quality of my writing when I was finished, and I think any novelist would benefit. I decided to give it away for beta testing to people during NaNoWriMo this November and perhaps charge $5 a copy after that. Since I don’t have a Mac and they don’t support executable jar files, I can only support it on Window 8 and above.


1 comment: