![]() The correct fix is to look for the escape sequences output by djvutxt rather than the raw characters. DJVU ZONE DJVULIBRE CODEIn any case… Provided I'm not just being a dummy again, I'm inclined to conclude that the original code was never functional (but did no harm), and thus my original proposed change did nothing either. Ironically, it looks like came this close to discovering the problem in 2013 when they fixed a similar problem in closely related code ( pageTextCallback() is invoked 10 lines later). (Note, \013 isn't actually carriage return as the comment implies CR is \015).īut this begs the question: did this code actually do anything to begin with? It was added by ThomasV in this commit in 2009, along with a call to iconv() seemingly intended to nuke invalid UTF-8 characters (which probably stopped working soon after due to a long-standing PHP bug). Since \013 (ASCII 0x0B, "VT", vertical tab) is being used as a column separator (the other two are for regions and paragraphs) we can treat them the same. # Replace runs of OCR region separators with a single extra line break $txt = preg_replace ( "/ \\ (013|035|037) /", " \n ", $txt ) \nSeveral expeditions from various parts were \nimmediately fitted out against him, and it \npresently became no longer compatible with \n\037\013his safety to remain thus clearly outlined \n \n\037\013" ) ![]() His was now too re- \nmarkable a figure in the eyes of the world. \n\037As may readily be supposed, it was now \nno longer possible for Captain Keitt to hope \nto live in such comparative obscurity as he \nhad before enjoyed. \nThree Lascars of the crew alone escaped \nto bear the news of this tremendous disaster \nto an astounded world. \n\037Having achieved the capture of this in- \ncredible prize, our captain scuttled the great \nship and left her to sink with all on board. \nBut the capture of so extraordinary a prize \nas that of the ruby - which was, in itself, \nworth the value of an entire Oriental king- \ndom-exalted him at once to the very high- \nest pinnacle of renown. "Pe OW Bot) OLR RT SLAM OOOR \n\037\013The seizure by the pirate of so consider- \nable a person as that of the Queen of Kish- \nmoor, and of the enormous treasure that \nhe found aboard her ship, would alone have \nbeen sufficient to have established his fame. The code should therefore be replaced with something like… DJVU ZONE DJVULIBRE MANUALMultiply this by 5–10 paragraphs per page, for several hundred pages per book, and even after just a few hundred books the amount of wasted manual effort for volunteers becomes relatively staggering (and English Wikisource alone currently hosts around a million proofread pages). The practical consequence of this is that proofreaders on the Wikisources have to manually identify all paragraph breaks in the source text by visually identifying them in the scanned page image, locating the equivalent point in the wikitext, and inserting an extra line break. These characters are the markers djvutxt uses to signal the presence of a paragraph break (or other OCR page area break), so removing them (ignoring them) leads to consecutive paragraphs or regions of text being smushed together instead of separated by a blank line. ![]() ( \036 (ASCII 0x1E, "RS", record separator is not used by DjVuLibre that I can tell). …removes various control characters from the OCR text layer output from djvutxt output, including \035 (ASCII 0x1D, "GS", group separator) and \037 (ASCII 0x1F, "US", unit separator). ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |