The auto classification function is an AI-based classifier.  It uses the tokenized items (a process that splits raw text into words and phrases based on spaces and punctuation) from the assigned training fields as the input plus some pre-processing to eliminate highly common terms.  All training fields are weighted equally.  The user can control which fields are used as training.

For example, if you set Abstract and Title as the training fields (the default setting) and set up a Record Classification containing two categories (Red & Blue), as the user manually assigns records into the Red and Blue categories, the classifier tokenizes the Abstracts and Titles, removes highly frequent common tokens across the entire file, and then compares the remaining tokens in the Red records with the remaining tokens in the Blue records.  It checks these against the tokens in all the records to determine a confidence level in its ability to assign records.

So for example, if all your manually assigned Red records have “Widget” and all your manually assigned Blue records have “Gadget”, then the algorithm begins to associate Widget with Red and Gadget with Blue.  But do keep in mind that the algorithm is doing this score keeping across all the tokens in each assigned record.  So some tokens will be unique to Red and help define Red while others will be general tokens that appear in both Red and Blue. These common tokens drive down the confidence level.  The tokens that are unique to one category improve the confidence level.  Tokens that proportionally appear more in one category over others also improves the confidence level.

The algorithm is designed to need less than 100 manually assigned records to produce a usable result.  However, if the user garbles the training (puts things in the wrong category) or uses multi-valued categories (the same record can be placed in multiple categories) then things can take a little longer and confidence levels will not be as high.  The system will also have issues if the user does not put any or just a few records into one category.  For example, if your record classification has six categories but the user only trains five of the categories, the lack of training in the empty category will decrease the system's confidence.

The user can manually train as many records as they wish.  The system will update the classifier’s confidence dynamically.  Once the user has reached an acceptable confidence level, they can start the auto-classifier to assign the remaining records.  If the auto-classifier is unable to assign a record, it will place it in a new “Unclassified” category.

Deciding acceptable confidence is up to the user.  The confidence measure is a measure of the system’s belief in its ability to correctly assign any given record.  As the user is training the classifier, the AI system is building a knowledge base.  The confidence measure is an assessment of the quality of the knowledge base.

Best Practices

  • Create all your classifications and categories up front then go into the Smart Trainer. This will help the algorithm pick records to classify in an attempt to populate each category early on in the process.
  • Create a disregard category if you have records that you don’t want to factor into the Smart Trainer at all. Records that go into the disregard category are ignored entirely for the purposes of auto-classification.
  • When ‘Allow Multiple Selection’ is toggled on for a Classification, the Smart Trainer will treat each Category as “Yes/No.” This can be beneficial if it fits your specific scheme, but in general the Smart trainer will perform better on discrete categories like the Red/Blue example.
  • When importing a knowledge base, the data must contain the same training fields as the dataset it was exported out of, and should generally be from a similar technology space since the Smart Trainer relies on matching terminology within the training fields.

More FAQ

If I run the auto-classifier on the one class after it's reached an acceptable threshold, then continue to smart train on the others, will it re-show me records I've only partly classified?

ANSWER: The Smart Trainer will favor showing you records that have not been classified in EITHER classification, so it may be a while before you see any that have already been classified in your original classification.

If I'm adding new categories to a class I've already run auto-classify on, and created an unclassified group, use the smart trainer, then rerun the auto-classifier, does it keep the same choices as before or reallocate based on the new additions? What happens to the unclassified records?

ANSWER: Once the auto-classifier runs, it will not second guess itself even if you change the categories.  So, if your original categories are RED and BLUE and you run the classifier, if you later add a GREEN category, the classifier will not look at a record and say “I put this in RED before, but now I think I’ll change it to GREEN”.  For that matter, the classifier won’t even change it’s mind on things it puts in the Unclassified group.  You’ll need to delete that category to “free up” those records to be re-classified.

If I create a sub-dataset, does the smart trainer reset itself or carry over the ‘training’?

ANSWER: Creating a sub-dataset will carry over the classifications, but the underlying knowledge base will re-build itself based on the smaller dataset. i.e. the auto-classifier might make different predictions in a sub-dataset than it would in the original.  To use the knowledge base as-built and then apply it to a sub-dataset, we recommend training in the main dataset, exporting the kb, creating the subdataset, and importing the kb. Then run the auto-classifier for the imported kb field.