How OCR Works in DocuShare Go – Xerox DocuShare Support Site

Applicable To

DocuShare Go version 26.1 and later.

Overview

DocuShare Go automatically processes uploaded files through an OCR pipeline to make content searchable.

Three OCR modes are available: Basic OCR Only, Threshold-Based, and Always Use Content Assistant OCR.

The selected mode determines whether existing text is trusted, how aggressively the AI engine is applied, and how many Content Assistant pages are consumed from the organization data pack.

OCR Modes

An administrator selects one mode for the entire organization on the Manage Subscription page.

Basic OCR Only (default)

If the file already contains a text layer, that text is indexed and OCR is skipped.

If no text layer exists, Basic OCR runs on the raw images. Handles typed text and clean scans well. Does not handle handwriting or poor-quality scans.

- Content Assistant pages consumed: None.

Threshold-Based (recommended)

Same as Basic OCR Only, with one addition: if Basic OCR confidence falls below the threshold, the file is automatically re-processed with Content Assistant OCR.

- Content Assistant pages consumed: Pages from the data pack per document where Basic OCR falls below threshold.

- No charge for documents where Basic OCR is sufficient.

NOTE: The Confidence Threshold slider (default 95%) is on the Manage Subscription page. Applies to Threshold-Based mode only.

Always Use Content Assistant OCR

Any existing text layer is ignored. Every file is sent directly to Content Assistant OCR on the raw page images.

Highest quality including handwriting. Pre-existing upstream OCR text is intentionally overridden.

- Content Assistant pages consumed: Pages from the data pack on every document, without exception.

- Falls back to Basic OCR if the data pack is empty.

How the OCR Pipeline Works

Basic OCR Only and Threshold-Based Modes

Both modes check for an existing text layer before running OCR. Four-stage pipeline:

1. Does the file already contain a text layer?

Word, Excel, PowerPoint, and text files always have a text layer. PDFs have a text layer if digitally exported or pre-processed by an upstream system (EasyDoc, MFP, Adobe Acrobat, etc.). When existing text is found, it is indexed and OCR is skipped.

2. No text found - run Basic OCR.

Basic OCR reads printed text and produces a confidence score. Does not handle handwriting or poor-quality scans. In Basic OCR Only mode, this is the final stage.

3. Confidence too low - escalate to Content Assistant OCR. (Threshold-Based mode only.)

Content Assistant OCR handles handwriting, mixed forms, and poor-quality scans significantly better. Content Assistant pages from the data pack are consumed per document. Falls back to Basic OCR result if the data pack is empty.

4. Index the resulting text for search.

Always Use Content Assistant OCR Mode

Stage 1 is skipped entirely. DocuShare Go ignores any existing text layer in the file and sends every document directly to Content Assistant OCR on the raw page images.

- Highest-quality result on every document, including handwriting.

- Content Assistant pages from the data pack consumed on every document, without exception.

- Falls back to Basic OCR if the data pack is empty.

- Pre-existing OCR text from upstream tools is intentionally ignored.

Frequently Asked Questions

Q1. If I change OCR settings, will existing documents be re-processed?

No. The new mode applies only to files uploaded after the change is saved. Reprocessing all documents on every settings change would be slow, expensive in Content Assistant pages, and could replace good text with lower-quality text.

NOTE: To reprocess a specific document, upload it again as a new version.

Q2. Can DocuShare Go use OCR text from an upstream system (EasyDoc, MFP, Adobe Acrobat)?

Behavior depends on the selected mode.

- Basic OCR Only / Threshold-Based: Existing text layers are read and indexed directly. No configuration required. DocuShare Go trusts the text it finds. If upstream OCR quality is poor, that poor text is indexed as-is.

- Always Use Content Assistant OCR: All existing text layers are ignored. Use this mode when upstream OCR quality is poor or inconsistent.

NOTE: Document metadata fields (title, author, custom properties) are separate from the OCR pipeline.

Q3. A document is not appearing in search results. Why?

Review these causes in order of frequency:

- Cause 1: Upstream OCR produced poor indexed text (most common for handwriting). Example: MFP embeds garbled text; DocuShare Go indexes it. The document is indexed but only matches garbled words no user would search for.

Fix - either of the following works:

(a) Switch the organization to Always Use Content Assistant OCR. DocuShare Go will then ignore the upstream text layer and re-OCR every file with Content Assistant. Remember this only affects new uploads (see Q1); the existing document must be re-uploaded as a new version.

(b) Turn off in-device OCR on the MFP (or other upstream OCR solution) before scanning handwritten material, then re-upload the document. DocuShare Go will then see a raw image and run its own OCR pipeline.

- Cause 2: OCR mode cannot handle this document type. Basic OCR Only cannot handle handwriting. Switch to Threshold-Based or Always mode, then re-upload affected documents.

- Cause 3: Content Assistant data pack is exhausted. Check the data pack balance on Manage Subscription.

- Cause 4: Document is still being processed. Most documents are searchable within a few minutes. If not searchable after 30 minutes and other causes are ruled out, open a ticket with support to investigate.

Guidance for Common Scenarios

Your organization has handwriting requirements

Use Threshold-Based OCR (explicitly recommended in the Release 26.1 notes). It handles mixed workloads efficiently: fast on typed and digital-source documents, accurate on handwriting, and only consumes Content Assistant pages when Basic OCR is insufficient.

Your scanner or MFP pre-OCRs files before upload

In Basic OCR Only and Threshold-Based modes, DocuShare Go reads and indexes the text layer your upstream device produced. If your documents include handwriting or the MFP OCR quality is poor, you have two options:

1. Switch to Always Use Content Assistant OCR. DocuShare Go will ignore the upstream text layer and re-OCR every file with Content Assistant.

2. Disable in-device OCR on your MFP for handwritten material before scanning. DocuShare Go will then receive the raw image and apply its own OCR pipeline.

You are changing your organization's OCR mode

The new mode applies only to documents uploaded after the change is saved. Existing documents are not reprocessed. For high-priority existing documents, re-upload them as a new version to apply the new settings.

Search is not finding a specific document

Work through the causes listed in Q3 in order. Cause 1 - where an upstream system produced poor-quality OCR text that DocuShare Go indexed as-is - is the most common and most counterintuitive. The document is technically in the search index; it just cannot be found because the indexed text does not match what anyone would type into the search box.

Related articles