OCR (Optical Character Recognition) is a technique to recognize characters based on the pixels order. Every image is set of ordered pixels (picture elements). Similarly, each character on image is a set of ordered pixels. Each pixel has a color number to display that color. Each character (alphabet, number, etc.) is combination of ordered pixels. |
There are two types of PDF Converters: |
OCR-Not-enabled |
OCR-enabled |
All GIRDAC PDF Converters except PDF to Word Converter belong to the second category. Some PDF documents have text on images (scanned PDF files). GIRDAC PDF Converter Ultimate extracts such text as formatted text through OCR (Optical Character Recognition) Layout option. Text on image may be in black and white, grayscale or color. Extracted text is in black color. Accuracy depends on image quality, font, font size, special characters and symbols. It does not pick images and shapes in PDF file. It currently works with English language text. |
OCR software converts hand-written or typewritten text documents into machine editable text formats. Earlier versions of OCR are trained to translate specific fonts. The current OCRs are intelligent enough to recognize most of the fonts with high accuracy. Some OCRs can converts the image into a formatted version same as the original image. OCR uses algorithms to recognize characters and Neural Networks to increase the accuracy. |
There are two methods employed in OCR software. |
Matrix matching |
Feature extraction |
Matrix matching is simpler than Feature extraction. Matrix Matching compares each character with a library of character matrices. When an image matches one of the matrices of pixels, it labels that image as the corresponding character. |
Feature Extraction uses artificial intelligence to analyze features such as closed shapes, diagonal lines, line intersections, etc. This method is flexible and it is employed in both type-written and hand-written documents. |
Go to: What is PDF Converter? |
Go to: What is PostScript? |