Classify Documents in FileCloud using Smart Classification

This is the last post in the series “Smart Classification, Metadata and Smart DLP – the powerful combo” about data classification and security in FileCloud.

In previous posts, we explained the concepts and capabilities of FileCloud’s Metadata and Data Leak Prevention Subsystems. Metadata allows users to describe data available in their system by assigning an extensive set of information to them. DLP, on the other hand, allows action prevention based on the assigned metadata. In the real-life, enterprise-level scenario, the above process is very common; the problem starts when the volume and complexity of data increases and the job of assigning metadata manually is no longer achievable. This is where Smart Classification becomes essential.

Smart Classification

Smart Classification is a powerful and flexible subsystem that allows organizations to define custom rules. Documents in FileCloud can be classified and metadata attributes applied based on that classification. This process is automated, and each newly uploaded file is checked against all available, active rules. The content of that file is evaluated against the provided set of rules and the corresponding metadata is applied based on the result of that evaluation.

Smart classification is also referred to as the Content Classification Engine (CCE); both names will be used interchangeably in this article. CCE automates, streamlines, and strengthens the overall level of data leak prevention for an organization. Administrators and users can upload files and folders with the knowledge that uploaded information will be automatically classified according to the content; this helps ensure that sensitive data is immediately covered by the criteria outlined in the DLP plan. CCE rules are also applied retroactively to data that was uploaded before the rules were created, helping organizations protect legacy data.

Conceptually, this is a very simple process, yet a lot happens behind the scenes. Let’s dive deeper into the current possibilities in FileCloud, version 20.3.

When an admin visits the Smart Classification screen, they will see a detailed list of all defined CCE rules.

The CCE rule definition is flexible, allowing admins to shape the use case to their individual needs by specifying classifiers, event triggers, preconditions, conditions and match/no match actions.

Event triggers

Specifies the event, upon which, the rule should be automatically executed. The FILEINDEXED event, occurring when a document is successfully indexed in SOLR, is the most common across use cases. In FileCloud version 20.3, only ICAP DLP classification uses different triggers, taking place at file upload rather than SOLR indexing.

Auto-classification

Specifies whether an uploaded file should be automatically classified or not. If auto-classification is disabled, admins can always run a ‘catch-up’ classification on-demand from the admin panel.

Definition

This is the heart of each CCE rule, specifying all details regarding when classification should happen, which classifier should be used, what condition is treated as a match and which metadata values should be set depending on the rule evaluation result.

Rule Definition

Each definition contains the following sections:

Classifier

Specifies which classifier is used to classify the file content. Currently, the following classifications are supported:

  • Default – classifies content into terms that match the supplied regex patterns. Supported parameters: SEARCH_PATTERN_SET, SEARCH_PATTERN_NAME or SEARCH_PATTERN_GROUP

Result Schema (_classifications): [{term: “term that matched a regex”, count: “number of times the term appears in the doc”}, …] – specifies how many times each term has been matched.

  • PatternMatch – classifies content into regex patterns found. Supported parameters: SEARCH_PATTERN_SET, SEARCH_PATTERN_NAME or SEARCH_PATTERN_GROUP

Result Schema (_classifications): [“regex pattern”, …] – specifies which patterns has been matched at least once.

  • StandardQuery – classify content as matching the query or not.

Pre-condition

Specifies rules that must be met by a file before it is evaluated through the classifier. Some sample pre-conditions:

  • Classify only .txt files uploaded to team folders

“precondition”:”starts_with(_file.fullPath, ‘/teamfolders/’) && _file.ext==’txt'”

  • Classify files less than 1000 bytes

“precondition”: “_file.size < 1000”

  • Classify files with one of the specified extensions

“precondition”: “_file.ext in [‘txt’, ‘pdf’, ‘csv’]”

  • Classify all files

“precondition”: “true”,

Condition

criteria to take a match or default action on the files. Currently, it must be count(_classifications) > 0 that indicates the file contains the search pattern.

  • SEARCH_PATTERN_SET – Any valid regular expression, i.e. /[0-9]{9}/
  • SEARCH_PATTERN_NAME – Regular expression defined in the Manage Pattern Group – Available Patterns
  • SEARCH_PATTERN_GROUP – Regular expression defined in Manage Pattern Group – Pattern Group

Match Action

Actions taken when classifier finds a file based on precondition, condition and parameters. For example, assign metadata set PII and attribute Confidential to true.

Default Action

Actions taken when the classifier finds a file based on a precondition but the condition, based on parameters, is not met. For example assign metadata set PII and metadata attribute Confidential to false.

CCE Limitations

Before jumping into our real-life example, there are some important limitations to note regarding CCE within FileCloud.

  • CCE will only function properly if SOLR has been configured and storage has been indexed. Additionally, administrators must have created at least one set of metadata for the classification process to operate.
  • Since rules that apply to the same metadata attribute often result in unexpected classification (the order of rule execution is not guaranteed), each rule should have a unique metadata attribute.
  • To prevent overwriting metadata intentionally added by users, CCE does not overwrite metadata it didn’t add itself. Users must remove manually added metadata set values to allow CCE to add its own metadata.
  • CCE updates classification if a file no longer meets the condition of a rule after it is updated and re-uploaded. For example, if a file with a credit card number that is classified as PII is re-uploaded without the credit card number, the PII classification is removed.
  • Empty files cannot be indexed and classified.
  • The default maximum size for indexed files is 10MB; therefore, by default, files larger than 10MB are not classifiable by CCE and are not available for content search.
  • As of FileCloud Version 20.3, if you have OCR enabled, CCE scans image and PDF files for matching patterns.

Real-Life Example: Social Security Numbers

Let’s display the capabilities of the CCE system by running a real-life sample and classifying documents based on the occurrence of the U.S Social Security Number. We’ll limit ourselves to files that are smaller than 1MB, but to display the full potential of Smart Classification, let’s classify documents in the following way:

  • If the file has at least two unique SSN number, set the value of the PII.High attribute to “true”
  • If the file has at least one unique SSN number, set the value of the PII.Level attribute to “Confidential”.

The following rules describe the above scenarios:

Rule 1 – set PII.High attribute for more than one SSN. Otherwise it will be set to false, which is a default metadata attribute value.

Rule 2 – set PII.Level to “Confidential” if there is at least one SSN number defined in the document.

For the purpose of the demo, three files are being used:

  • No SSNs.txt – with no SSN number specified
  • Single SSN – multiple occurrences – with a single SSN number provided multiple times
  • Multiple SSNs – multiple unique SSNs are specified

The following classifications are observed:

No SSN – the Level field is set to “-“, based on the default action in the second rule, and PII.High is left intact.

Single SSN – the Level field is set to “Confidential” based on Rule 2’s match action. Since there is only one unique SSN defined, PII.High is not set.

Multiple SSNs – the Level field is set to “Confidential” and the PII.High is set to “true”, since there are more than two unique SSN rules defined in that file.

As shown above, the classification gave the expected results for all three files. Now, let’s combine it with the following DLP rules:

  1. Block any downloads when PII.High is set to true
  2. Allow downloads but raise a violation when the PII.Level is set to “Confidential”.

The following behaviors are observed for the file that contains multiple SSNs:

 

The violation for the Single SSN is logged:

The above example displays the power behind the CCE-Metadata-DLP combination. This powerful functionality can operate in the “fire-and-forget” mode, meaning that once metadata and attributes are set, the defined CCE and DLP rules will work automatically, in the background, ensuring sensitive data are protected for the organization.

Article written by Tomasz Formański