Classify Documents in FileCloud using Smart Classification

June 11, 2021

This is the last post in the series “Smart Classification, Metadata and Smart DLP – the powerful combo” about data classification and security in FileCloud. In previous posts, we explained the concepts and capabilities of FileCloud’s Metadata and Data Leak Prevention Subsystems. Metadata allows users to describe data available in their system by assigning an […]

This is the last post in the series “Smart Classification, Metadata and Smart DLP – the powerful combo” about data classification and security in FileCloud.

In previous posts, we explained the concepts and capabilities of FileCloud’s Metadata and Data Leak Prevention Subsystems. Metadata allows users to describe data available in their system by assigning an extensive set of information to them. DLP, on the other hand, allows action prevention based on the assigned metadata. In the real-life, enterprise-level scenario, the above process is very common; the problem starts when the volume and complexity of data increases and the job of assigning metadata manually is no longer achievable. This is where Smart Classification becomes essential.

Smart Classification

Smart Classification is a powerful and flexible subsystem that allows organizations to define custom rules. Documents in FileCloud can be classified and metadata attributes applied based on that classification. This process is automated, and each newly uploaded file is checked against all available, active rules. The content of that file is evaluated against the provided set of rules and the corresponding metadata is applied based on the result of that evaluation.

Smart classification is also referred to as the Content Classification Engine (CCE); both names will be used interchangeably in this article. CCE automates, streamlines, and strengthens the overall level of data leak prevention for an organization. Administrators and users can upload files and folders with the knowledge that uploaded information will be automatically classified according to the content; this helps ensure that sensitive data is immediately covered by the criteria outlined in the DLP plan. CCE rules are also applied retroactively to data that was uploaded before the rules were created, helping organizations protect legacy data.

Conceptually, this is a very simple process, yet a lot happens behind the scenes. Let’s dive deeper into the current possibilities in FileCloud, version 20.3.

When an admin visits the Smart Classification screen, they will see a detailed list of all defined CCE rules.

The CCE rule definition is flexible, allowing admins to shape the use case to their individual needs by specifying classifiers, event triggers, preconditions, conditions and match/no match actions.

Event triggers

Specifies the event, upon which, the rule should be automatically executed. The FILEINDEXED event, occurring when a document is successfully indexed in SOLR, is the most common across use cases. In FileCloud version 20.3, only ICAP DLP classification uses different triggers, taking place at file upload rather than SOLR indexing.

Auto-classification

Specifies whether an uploaded file should be automatically classified or not. If auto-classification is disabled, admins can always run a ‘catch-up’ classification on-demand from the admin panel.

Definition

This is the heart of each CCE rule, specifying all details regarding when classification should happen, which classifier should be used, what condition is treated as a match and which metadata values should be set depending on the rule evaluation result.

Rule Definition

Each definition contains the following sections:

Classifier

Specifies which classifier is used to classify the file content. Currently, the following classifications are supported:

Result Schema (_classifications): [{term: "term that matched a regex", count: "number of times the term appears in the doc"}, ...] – specifies how many times each term has been matched.

Result Schema (_classifications): ["regex pattern", ...] – specifies which patterns has been matched at least once.

Pre-condition

Specifies rules that must be met by a file before it is evaluated through the classifier. Some sample pre-conditions:

"precondition":"starts_with(_file.fullPath, '/teamfolders/') && _file.ext=='txt'"

"precondition": "_file.size < 1000"

"precondition": "_file.ext in ['txt', 'pdf', 'csv']"

"precondition": "true",

Condition

criteria to take a match or default action on the files. Currently, it must be count(_classifications) > 0 that indicates the file contains the search pattern.

Match Action

Actions taken when classifier finds a file based on precondition, condition and parameters. For example, assign metadata set PII and attribute Confidential to true.

Default Action

Actions taken when the classifier finds a file based on a precondition but the condition, based on parameters, is not met. For example assign metadata set PII and metadata attribute Confidential to false.

CCE Limitations

Before jumping into our real-life example, there are some important limitations to note regarding CCE within FileCloud.

Real-Life Example: Social Security Numbers

Let’s display the capabilities of the CCE system by running a real-life sample and classifying documents based on the occurrence of the U.S Social Security Number. We’ll limit ourselves to files that are smaller than 1MB, but to display the full potential of Smart Classification, let’s classify documents in the following way:

The following rules describe the above scenarios:

Rule 1 – set PII.High attribute for more than one SSN. Otherwise it will be set to false, which is a default metadata attribute value.

Rule 2 – set PII.Level to “Confidential” if there is at least one SSN number defined in the document.

For the purpose of the demo, three files are being used:

The following classifications are observed:

No SSN – the Level field is set to “-“, based on the default action in the second rule, and PII.High is left intact.

Single SSN – the Level field is set to “Confidential” based on Rule 2’s match action. Since there is only one unique SSN defined, PII.High is not set.

Multiple SSNs – the Level field is set to “Confidential” and the PII.High is set to “true”, since there are more than two unique SSN rules defined in that file.

As shown above, the classification gave the expected results for all three files. Now, let’s combine it with the following DLP rules:

  1. Block any downloads when PII.High is set to true
  2. Allow downloads but raise a violation when the PII.Level is set to “Confidential”.

The following behaviors are observed for the file that contains multiple SSNs:

 

The violation for the Single SSN is logged:

The above example displays the power behind the CCE-Metadata-DLP combination. This powerful functionality can operate in the “fire-and-forget” mode, meaning that once metadata and attributes are set, the defined CCE and DLP rules will work automatically, in the background, ensuring sensitive data are protected for the organization.

Article written by Tomasz Formański

By Katie Gerhardt

Jr. Product Marketing Manager