Sensitive data model¶
We recognize a number of data types that can potentially identify a person, like name or email. We also take into account the context within which this particular data was used, like a combination of the first and last names with the street address, or whether the name was actually a company name and not person's name. Based on that, we reason whether that usage posed a risk of actually identifying a person, and how high was that risk.
We developed our own machine learning model that does data detection and classification.
You can think of the Soveren model as a two-stage classifier. First, for each observed field the classifier determines the data type, for example email. And then depending on the context the classifier decides if this was actually sensitive data or not. For example if the person's identity could actually be revealed if the observed value was disclosed.
On top of that, the model assigns different weights or sensitivities to different data types and their combinations. Those sensitivities define how likely it is to get the actual person's identity when this data is disclosed.
Recognised sensitive data types¶
Right now Soveren detects the following data types:
| Name | Kind | Sensitivity | Comment |
|---|---|---|---|
Person |
PII data | Medium | This is a person's name which can be any combination of the first and last names. |
Gender |
PII data | Low | Gender, or more precisely sex of a person (male or female). |
Birth date |
PII data | Low | Date of birth of a person. This can be any conceivable representation of a date, in the form of any combination of day / month / year, or even a Unix timestamp. |
Location |
PII data | High | Location where the person may reside, i.e. to be present physically, or live or receive a postage. This includes coordinates like latitude / longitude and all details of physical address (country code / city / street / building etc). |
Phone |
PII data | Medium | Phone number. |
Email |
PII data | Medium | Email address. |
Username |
PII data | Medium | User name. |
IP address |
PII data | Medium | IP Address. |
Passport |
PII data | High | Passport data, including the number. |
Pension number |
PII data | High | Pension number. |
Tax number |
PII data | High | Taxpayer identification number. |
SSN |
PII data | High | US Social Security Number. |
Driver license |
PII data | High | Driver license number or code. |
IBAN |
PII data | High | International Bank Account Number. |
Identity document |
PII data | High | Identity document. |
PII data |
PII data | High | Generic personally identifiable information (PII). |
Card |
PCI data | High | Credit or debit card number, checked for validity according to standards. |
Expiration date |
PCI data | High | Expiration date of a credit or debit card. |
CAV2/CVC2/CVV2/CID |
PCI data | High | Security code of a credit or debit card. |
Cardholder name |
PCI data | High | Person's name as it appears of the credit or debit card. |
Full track data |
PCI data | High | Data stored on the magnetic strip of a credit or debit card. |
Masked card number |
PCI data | Medium | Partially masked number of a credit or debit card (last 4 digits, first 4/last 4 etc.). |
PCI data |
PCI data | High | Generic Payment Card Industry (PCI) information. |
Security token |
Developer secrets | High | Security token. |
Private key |
Developer secrets | High | Private key. |
MAC address |
Developer secrets | Medium | Medium access control (MAC) address. |
IMEI |
Developer secrets | Medium | International Mobile Equipment Identity. |
Password |
Developer secrets | High | Password. |
Authorization code |
Developer secrets | High | Authorization code. |
User ID |
Developer secrets | Medium | Identifier of a user. |
Developer secrets |
Developer secrets | High | Generic developer secrets. |
The list of supported data types is ever-growing. Drop us a line if you think that we should support some particular data type which you'd use as PII or consider otherwise sensitive.
Custom data types¶
You can add your own custom data types using regular expressions to match the field (key) name and it's value:
Sensitivity model¶
We consider both individual data types and their combinations, because sensitivity of the combined data set can be significantly higher than that of any individual data field. For example, the name itself does not reveal much in terms of identification when used alone. But the name combined with the postal address can reveal the identity with much higher certainty.
There are three levels of sensitivity: Low, Medium and High.
All sensitive data types that we recognize are individually assigned the following levels:
- Low:
Birth date,Gender - Medium:
Person,Phone,Email,IP address - High:
Location,Card,Driver license,Passport,Tax number,SSN,Pension number,IBAN
These sensitivity levels are described by different numerical weights. Thus, different data type combinations result in different combined sensitivities. For example, Birth date combined with Gender still result in Low sensitivity. Similarly, Person + Phone + Email are of Medium sensitivity, whereas Person + Phone + Email + Gender is of High sensitivity from the potential person's identification point of view.
