AnalyzeAssist Help Send comments on this topic.
Segmentation

AnalyzeAssist can segment* the following file types. It will skip numbers by default; see Segmentation Rules for details on how to change the segmentation rules. See Configuring File Extensions for instructions on changing file associations for AnalyzeAssist.

* Segment: Separate a file into segments, which are translation units generally equivalent to sentences

Text Files
Supported encodings include:
  • UTF-8
  • UTF-16
  • UTF-16 (big-endian)
Although AnalyzeAssist can detect SJIS and other multibyte encodings, for best results it is recommended that you make sure your text files are in one of the Unicode encodings above, with a byte-order mark (BOM). (Programs like Notepad will automatically add a BOM when saving a text file in Unicode format.)
Microsoft Word Files
AnalyzeAssist will also extract text from text boxes.
Microsoft Excel Files
AnalyzeAssist will also extract text from shapes on each worksheet.
Microsoft PowerPoint Files
AnalyzeAssist will also extract text from MS Word/Excel objects embedded in PowerPoint slides, although results for textboxes/shapes further embedded in these objects are not guaranteed.
HTML Files
Extracts the text displayed in the browser, including the document title, and "alt" and "title" tags in links/images.
XML Files
Extracts text data from the xml nodes.

See Also