Skip to main content

Modifying Terms on Datasets

Why Would You Use Terms on Datasets?

The Business Glossary(Term) feature in DataHub helps you use a shared vocabulary within the orgarnization, by providing a framework for defining a standardized set of data concepts and then associating them with the physical assets that exist within your data ecosystem.

For more information about terms, refer to About DataHub Business Glossary.

Goal Of This Guide

This guide will show you how to

  • Create: create a term named Rate of Return.
  • Read : read terms attached to a dataset SampleHiveDataset.
  • Add: add a CustomerAccount term to user_name column of a dataset named fct_users_created.
  • Remove: remove a term CustomerAccount from the user_name column of a dataset called fct_users_created.

Prerequisites

For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. For detailed information, please refer to Datahub Quickstart Guide.

note

Before modifying terms, you need to ensure the target dataset is already present in your DataHub instance. If you attempt to manipulate entities that do not exist, your operation will fail. In this guide, we will be using data from sample ingestion.

For more information on how to set up for GraphQL, please refer to How To Set Up GraphQL.

Create Terms

The following code creates a term Rate of Return.

mutation createGlossaryTerm {
createGlossaryTerm(input: {
name: "Rate of Return",
id: "rateofreturn",
description: "A rate of return (RoR) is the net gain or loss of an investment over a specified time period."
},
)
}

If you see the following response, the operation was successful:

{
"data": {
"createGlossaryTerm": "urn:li:glossaryTerm:rateofreturn"
},
"extensions": {}
}

Expected Outcome of Creating Terms

You can now see the new term Rate of Return has been created.

term-created

We can also verify this operation by programmatically searching Rate of Return term after running this code using the datahub cli.

datahub get --urn "urn:li:glossaryTerm:rateofreturn" --aspect glossaryTermInfo

{
"glossaryTermInfo": {
"definition": "A rate of return (RoR) is the net gain or loss of an investment over a specified time period.",
"name": "Rate of Return",
"termSource": "INTERNAL"
}
}

Read Terms

query {
dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)") {
glossaryTerms {
terms {
term {
urn
glossaryTermInfo {
name
description
}
}
}
}
}
}

If you see the following response, the operation was successful:

{
"data": {
"dataset": {
"glossaryTerms": {
"terms": [
{
"term": {
"urn": "urn:li:glossaryTerm:CustomerAccount",
"glossaryTermInfo": {
"name": "CustomerAccount",
"description": "account that represents an identified, named collection of balances and cumulative totals used to summarize customer transaction-related activity over a designated period of time"
}
}
}
]
}
}
},
"extensions": {}
}

Add Terms

The following code shows you how can add terms to a dataset. In the following code, we add a term Legacy to a dataset named fct_users_created.

mutation addTerms {
addTerms(
input: {
termUrns: ["urn:li:glossaryTerm:rateofreturn"],
resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)",
subResourceType:DATASET_FIELD,
subResource:"user_name"})
}

Note that you can also add a term on a dataset if you don't specify subResourceType and subResource.

mutation addTerms {
addTerms(
input: {
termUrns: ["urn:li:glossaryTerm:rateofreturn"],
resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)",
}
)
}

If you see the following response, the operation was successful:

{
"data": {
"addTerms": true
},
"extensions": {}
}

Expected Outcome of Adding Terms

You can now see Legacy term has been added to user_name column.

term-added

Remove Terms

The following code remove a term from a dataset. After running this code, Legacy term will be removed from a user_name column.

mutation removeTerm {
removeTerm(
input: {
termUrn: "urn:li:glossaryTerm:rateofreturn",
resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)",
subResourceType:DATASET_FIELD,
subResource:"user_name"})
}

Note that you can also remove a term from a dataset if you don't specify subResourceType and subResource.

mutation removeTerm {
removeTerm(
input: {
termUrn: "urn:li:glossaryTerm:rateofreturn",
resourceUrn: "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)",
})
}

Also note that you can remove terms from multiple entities or subresource using batchRemoveTerms.

mutation batchRemoveTerms {
batchRemoveTerms(
input: {
termUrns: ["urn:li:glossaryTerm:rateofreturn"],
resources: [
{ resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"} ,
{ resourceUrn:"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"} ,]
}
)
}

Expected Outcome of Removing Terms

You can now see Rate of Return term has been removed to user_name column.

term-removed