Azure Data Factory Video
13 page views
1792 video views
2024, November 20, Wednesday

#075 How to remove duplicate records in Azure Data Factory

Apresentamos nesse vídeo como carregar planilhas eletrônicas do Azure Data Lake Storage removendo os registros duplicados, e configurando o gatilho para executar quando um novo arquivo é criado.

We'll get to know the techniques:

1. Enable

Azure Data Factory Studio preview capabilities (PREVIEW UPDATE):

By enabling preview capabilities, you can access and test the latest features and updates in Azure Data Factory Studio before they become generally available.

2. Create folder and upload files to Azure Data Lake Storage

(AZURE STORAGE EXPLORER):Using

Azure Storage

Explorer, you can create folders and upload files to Azure Data Lake Storage, providing efficient and scalable storage for your data.

3. Create Dataflow (DATAFLOW):

Dataflows are visual structures in Azure Data Factory that allow you to create, modify, and manage Extract, Transform, and Load (ETL) processes in an intuitive way.

4. Add source data source

(SOURCE, EXCEL FORMAT, SHEET INDEX):In

the dataflow, you can add a source data source, such as an Excel file, specifying the format and index of the desired spreadsheet.

5. Dynamically load files from folder

(SOURCE OPTIONS, WILDCARD PATHS):

To facilitate dynamics, you can use source options that support wildcard paths to dynamically load files from a folder.

6. Define column to store the file associated with the record

(COLUMN, STORE FILE NAME): During

the

transformation process, you can define a column to store the name of the file associated with each record, making it possible to trace the origin of the data.

7. Add Aggregate Step (AGGREGATE, GROUP BY, COUNT):

By adding an aggregate step, you can summarize the values of records based on specific criteria, such as COUNT in a group.

8. Store the records in a DATASET (PARQUET format) file:

By setting the target file format to Parquet, you optimize the storage efficiency and readability of the data.

9. Set file access permissions

(UNMASK, OWNER, GROUPS, OTHERS):You

can set file access permissions to ensure data security by specifying permission masks, owners, and groups.

10. Remove column from mapping (AUTO MAPPING, INPUT COLUMNS):

If necessary, it is possible to remove columns from the mapping during the data transformation, ensuring that only the desired columns are included in the final result.

11. Create pipeline to execute Data Flow (PIPELINE, DATA FLOW ACTIVITY): Pipelines

are used to orchestrate and schedule activities, including the execution of dataflows. The data flow activity within a pipeline initiates the execution of the ETL process.

12. TRIGGER, STORAGE EVENTS, BLOB CREATED: You