FRIDAY, MARCH 25, 2016
Eric Lee, A-SOCIATED PRESS
TOPICS: DIAGRAMMATIC LANGUAGE, FROM THE WIRES, LITERACY, UNIVERSAL LANGUAGE
TUCSON (A-P) — Clean and Straighten Pages of Semantography (Blissmbolics) Second Edition, 1965, by C. K Bliss.
My first attempt took weeks and ended up making the scanned pages readable. Put into a PDF file as cleaned up JPG files a readable ebook was produced. It was a full time endeavor of several weeks and there was a learning curve along the way, so results varied. I was impatient to get through the 882 pages and overlooked imperfections. The result was good enough to then read, but fell short of what was needed.
All pages need to be straightened and artifacts removed. Not just to make a PDF with better looking pages, but to then OCR each page to separate text from graphics. Bliss and his wife made many manual corrections, crossing out words, squeezing new ones in, so any OCR generated page will have to be manually corrected. The end result will be a PDF version or other ebook format that is searchable. Each page could also be saved as HTML and the entire book put online with a comment field at the bottom of each page. Bliss scholars and students could thereby share notes. The online version could also be searched. In editable form, a writer might rework the material to come up with Semantography Abridged to make Bliss' rough draft work more accessible to the less persistent. C.K. Bliss did what he could and if someone had done what Bertrand Russell suggested and put up the money needed to publish the work to perform "an important service to mankind," someone would have done a lot of editing and rewriting. Could still be done.
I have redone the first 49 pages to come up with how-to instructions. Originally I used Photoshop but as not everyone can afford or pirate a copy, I figured out how to use GIMP2, open source and free, to do what needs to be done. It turns out that aside from free, GIMP2 does a much better job. Download it. There may be software that would just 'do it', but GIMP2 is needed to realign parts with pages, and having a clean version of the original is valuable. Feeding software a clean version will help it. Feeding the raw images in and getting OCR out may be tempting, but step by step collaborative human/machine effort will get the job done right and well. With clean copies, a 30-day trial version of Adobe Acrobat Pro might finish the job. There appear to be numerous options in free OCR.
Here is the routine: GIMP2 comes up with three windows. Move the two small ones to the side with main one in the available space between. First menu top left is 'File', click and 'Open', select the first page to convert. It opens, enlarge the main window if needed, then do SHIFT + key 3 times to zoom in on the page. On the left menu vertical bar select the 'Fuzzy select tool' forth over. At bottom is a 'Threshold' bar. Set it to 30 to 60 depending how dark the shading is. If not noticeable, start at 30. The white area selected should cover the text area with the dark shaddow area not selected, if any, in the margin. Click on an area that looks white. If you do more SHIFT + presses you will see that what should be white space is pixelated. When you click on a white area all the area up to dark text or drawings is selected. If there is text in the unselected area do CTRL Z to undo and increase the threshold value. Hold CTRL and press X to cut the area. If one edge is still shaded, click on the edge of the shaded area to select and cut it, repeating as needed. Zoom out if need be to see area outside the graphic area and click on it. This makes the dotted lines around all the text go away. There may be spots in areas that should be white. Select the 'Rectangle select tool', the first on menu, and drag to put a rectangle around the spots. Do CTRL X to cut them out.
On pages with heavy shading, individual letters may need editing to add to overcut parts or select/cut shading from within them. Sometimes one can draw in parts of letters that are missing and use the eraser. If several letters are unreadable, find those letters elsewhere and copy (in GIMP after selecting and CTRL C, click somewhere to deselect area and actually copy it), then paste. On "bad" pages, extensive areas may need reconstruction. Relatively clean pages may take less than 2 minutes to do while "bad" pages can easily take 20 minutes or more.
The text may be crooked. The 15th icon on the menu is 'Rotate tool', select then click anywhere on image. A box appears. Rotate using a positive angle to rotate clockwise, or negative angle to go the other way. Use scroll bar to put a line of text close to the top or bottom of window. Rotate until it looks straight and click the rotate button to do the rotation which may take a few seconds. If not perfect, click anywhere again and realign.
Bliss often cut parts and glued them on crooked. Once text is straight, the header or other part may still be off. Select the 'Rectangle select' tool again and put a box around the part that needs to be straightened. Do CTRL X to cut it out then CTRL P to paste it back as a separate layer. Select the 'Rotate tool' and click on the part needing to be straightened. The dialog box comes up, rotate, the give the final result a final look over. If some areas are transparent without a white background, ignore as they will become white when image is exported.
To save, you need to 'export' it as a JPG. Click 'File' and just above 'Export as' is a 'Overwrite....' file name of page being edited. Click and put 'Quality' to 100% to keep the white area pure white without being pixelated. First time, click 'Save defaults' so all will be 100%. Export and close the main window and 'Discard changes'.
Done. Go to 'File', load another page, enlarge main window, zoom in, and so on. C.K. Bliss would appreciate it; it's what friends are for. He'd do it, but he's dead.
To see what you might be in for and to practice, compare the first 49 originals to the cleaned ones. See what you need to do to make them reasonably clean and straight.
Semantography01-49_cleaned.zip
Semantography50-59_cleaned.zip
Semantography60-69_cleaned.zip
I hate to have all the fun, so here are some to do:
Semantography01-49.zip
Semantography50-59.zip
Semantography60-69.zip
Semantography70-79.zip
Semantography80-89.zip
Semantography90-99.zip
Semantography100-199.zip
More can be offered, but let's see if any True FoBliss are out there. If you decide to do some, comment below to avoid duplication of effort.
A few images show the issues:
The original page.
First cut: Threashold at 30, clicked on a white area, did CTRL X to cut.
Second cut, clicked on remaining gray area, may need to try different areas of the gray from lighter to darker.
Final cut, the shading is gone but so are the first two or three letters of words.
Easy part done, had to scroll over to right to read and determine what the words on left were. Then found the same two or three letters elsewhere, copied, pasted, and moved to position. Also manually drew in missing parts of some letters. Nothing perfect, just readable at normal size without much to distract the reader. Note in the word "attempted." Bliss, or probably his proofreader wife, overtyped the "p" to correct. This is common. Often whole words are overtyped to insert. OCR will mess up. Text will have to be proofed and manually fixed. The "E" in "Esperantists" should have had the vertical part drawn in, but didn't notice at the time and may not be worth reloading the page and fixing. Making the image readable with minimal distractions is good enough.