Our expectation: when we read the corrected text, we expect to have atleast the same experience as reading the original pdf (if not better). Even otherwise, if you leave the text in a significantly better state than earlier, it is valuable.
- Top level: Perfect text and formatting.
- Next level: Perfect text, with basic formatting (described below). Reader won’t feel particular urge to consult the source most of the time.
- Next level: Almost perfect text (possibly missing diacritics and accents), with basic formatting (contiguous paragraphs, footnotes etc.).
- And so on.
Typing correct symbols
- Please use the correct symbols. Common mistakes: |(pipe) instead of ।(daNDa), :(colon) instead of visarga(ः).
If you cannot type unusual unicode characters, copy them from here and paste.
- ā Ā ī Ī ū Ū ṛ Ṛ ṝ Ṝ ḷ Ḷ ḹ Ḹ
- ṃ Ṃ ḥ Ḥ
- ṅ Ṅ ñ Ñ
- ṭ Ṭ ḍ Ḍ
- ś Ś ṣ Ṣ
- ē ō r̥ r̥̄ l̥ l̥̄ ṁ
No harm using ISO instead of IAST - we can fix it later.
No harm ignoring initial letter capitalization (ie ṣ instead of Ṣ and so on).
- General reference: MD Guide.
- italics -
_italics_. Bold -
- Headings -
## Top heading ### Subheading #### Subsubheading conent
- item 1 - item 1.1 - item 2 - item 2.1 - item 2.2 - item 2.2.1
- Paragraphs are separated by empty lines. Please remove empty spaces at the beginning of lines.
- As far as possible, prefer paragraphs without any line breaks (“Enter” keystrokes). Just use word-wrap in your editor program.
- Please don’t use backticks (`). Use only ‘ or “.
- Ensure that quotes match (for example: ’the wife of the king, the man of Devadatta’. Or ”the wife of the king, the man of Devadatta”.).
- Please make sure that the quotes are appropriately positioned - for example,
- this is bad:
'similar, '_ûnartha_, 'words, '_kalaha_ 'quarrel,'
- this is better:
'similar,' _ûnartha_, 'words,' _kalaha_ 'quarrel,'
- and this is best:
'similar', _ûnartha_, 'words', _kalaha_ 'quarrel',.
- We’re NOT OK with the “bad” punctuation, but ok with “better” and prefer “best” above.
- this is bad:
Often, footnotes which appear in the bottom of the page in a physical book, appear without separation in raw OCR text on screen. This confuses the reader. Hence, it should be properly formatted.
Consider the footnote in the image below (right click and open it in a new window for clearer view):
Here is how it should be presented in the markdown file:
- Footnote numbers have been formatted specially -
- Footnote definitions can be of two styles. Indenting is important in the second style, which can accommodate multiple paragraphs. (MD Guide.)
- We may choose to break paragraphs, but not sentences, so as to define footnotes near their place of use. It is ok to place footnotes at the nearest logical place - example at the end of the paragraph or list.
Tables and charts
- Please generate tables using this online tool.
- Consider ditto marks or identical text associated with other text in a list (example here): Just repeat the text.
- In case of other cases/ confusion, please contact us with a link to the page with the table/ chart/ figure. Don’t hesitate to ask.
… यथाऽऽहुः — > (२) सुगतो यदि धर्मज्ञः कपिलो नेति का प्रमा। > तावुभौ यदि धर्मज्ञौ मतभेदः कथं तयोः ॥ इति ।
Things to ignore
- Quotation mark placement which is not ‘bad’ as described in examples above - ie. don’t spend time trying to make it ‘best’.
- Empty spaces in lines. Don’t spend time correcting spaces like this.
- Please fork the repo, edit your files and send pull requests; a browser suffices - (I can guide on google meet).
- I must have sent out email invites to join https://trello.com/ocrcorrection . (We use trello for tracking tasks.) Please accept (contact me if you haven’t received an invite).
- Everyone needs to have a github account (We use github for checking and accepting corrections - example here) and join https://github.com/orgs/vishvAsa/teams/ocr-correction/ . Please create a github account and let me know offline so that I can send you an email invite to join https://github.com/orgs/vishvAsa/teams/ocr-correction/ .
Besides the above, (unless you are already computer-savvy and have other preferences) please set up your Windows computer (if you use another OS, let me know) in the following way, so that your work can be most enjoyable and seamless.
- Git for windows
- Atom editor
- Install Hugo (to verify your work on your computer)
- Option 1
- Get a file like hugo_0.xx.x_Windows-64bit.zip from Hugo releases
- extract zip file contents to some place like C:\Hugo\bin.
- add it to your path - Start button » System » Advanced System Settings on the left » Environment Variables… button on the bottom » User variables section » Path » Double click » Click the New… button » Type in the folder where hugo.exe was extracted
- Alternatively see here.
- Option 1