Free cookie consent management tool by TermsFeed Policy Generator

How to remove duplicate records in Azure Data Factory

Apresentamos nesse vídeo como carregar planilhas eletrônicas do Azure Data Lake Storage removendo os registros duplicados, e configurando o gatilho para executar quando um novo arquivo é criado.

We'll get to know the techniques:

1. Enable

Azure Data Factory Studio preview capabilities (PREVIEW UPDATE):
  • By enabling preview capabilities, you can access and test the latest features and updates in Azure Data Factory Studio before they become generally available.
2. Create folder and upload files to Azure Data Lake Storage

(AZURE STORAGE EXPLORER):Using

Azure Storage
  • Explorer, you can create folders and upload files to Azure Data Lake Storage, providing efficient and scalable storage for your data.

3. Create Dataflow (DATAFLOW):

  • Dataflows are visual structures in Azure Data Factory that allow you to create, modify, and manage Extract, Transform, and Load (ETL) processes in an intuitive way.
4. Add source data source

(SOURCE, EXCEL FORMAT, SHEET INDEX):In

  • the dataflow, you can add a source data source, such as an Excel file, specifying the format and index of the desired spreadsheet.
5. Dynamically load files from folder

(SOURCE OPTIONS, WILDCARD PATHS):

  • To facilitate dynamics, you can use source options that support wildcard paths to dynamically load files from a folder.
6. Define column to store the file associated with the record

(COLUMN, STORE FILE NAME): During

the
  • transformation process, you can define a column to store the name of the file associated with each record, making it possible to trace the origin of the data.

7. Add Aggregate Step (AGGREGATE, GROUP BY, COUNT):

  • By adding an aggregate step, you can summarize the values of records based on specific criteria, such as COUNT in a group.

8. Store the records in a DATASET (PARQUET format) file:

  • By setting the target file format to Parquet, you optimize the storage efficiency and readability of the data.
9. Set file access permissions

(UNMASK, OWNER, GROUPS, OTHERS):You

  • can set file access permissions to ensure data security by specifying permission masks, owners, and groups.

10. Remove column from mapping (AUTO MAPPING, INPUT COLUMNS):

  • If necessary, it is possible to remove columns from the mapping during the data transformation, ensuring that only the desired columns are included in the final result.

11. Create pipeline to execute Data Flow (PIPELINE, DATA FLOW ACTIVITY): Pipelines

  • are used to orchestrate and schedule activities, including the execution of dataflows. The data flow activity within a pipeline initiates the execution of the ETL process.

12. TRIGGER, STORAGE EVENTS, BLOB CREATED: You

  • can add a trigger that responds to storage events, such as creating a blob. This allows the pipeline to automatically trigger when new data is added.

13. Add File & View Pipeline Running Automatically (TRIGGER RUNS, PIPELINE RUNS):

  • When you add a file that meets the trigger criteria, the pipeline will automatically trigger. You can view the pipeline run in the run logs.

This content contains
  • Content Video
  • Language Portuguese
  • Duration 10m 42s
  • Subtitles Sim

  • Reading time 2 min 9 seg

avatar
Fabio Santos

Data Scientist and Consultant for Digital and Analytics Solutions


  • Share

Youtube Channel

@fabioms

Subscribe now