Azure Synapse Analytics Video
13 page views
1032 video views
2024, November 04, Monday

#074 How to Do Web Scraping in Azure Synapse Analytics

Apresentamos nesse video como utilizar a técnica de Web Scraping para extrair dados de páginas da internet utilizando o Python no Azure Synapse Analytics.

We'll get to know the techniques:

1. Use notebook with Apache Spark instance attached and Python programming language (ATTACH TO, LANGUAGE, PYSPARK):In

Azure Synapse Analytics, you can create notebooks using Apache Spark instance and the Python language. The ATTACH TO command is used to connect the notebook to the Spark instance, and the language is set to Python with LANGUAGE.
```
-- Command to attach to Spark in Python
%pyspark
```

2. View

the resources associated with the Apache Spark instance (CONFIGURE SESSION):

To view and configure the resources associated with the Spark instance, you can use commands such as CONFIGURE SESSION. This allows you to configure settings specific to the Spark session.
```
-- Configure Spark session resources
spark.conf.set("spark.some.config.option", "config-value")
```

3. Identify the installed packages of the Apache Spark instance (PKG_RESOURCES, WORKING_SET, PRINT):

You can identify the Python packages installed on the Spark instance using libraries such as pkg_resources and working_set.
```
-- Identify installed packages
import pkg_resources
for package in pkg_resources.working_set:
  print(package)
```

4. Import Python packages (PANDAS, REQUESTS, BEAUTIFULSOUP):

Import Python libraries needed for data manipulation, web requests, and scraping.
```
Import Pandas as PD
import requests
from bs4 import BeautifulSoup
```

5. Get web page content (REQUESTS, HTML CODE, ELEMENTS, TABLES, ROWS, COLUMNS):

Use libraries such as requests and BeautifulSoup to retrieve the content of a web page and extract information.
```
url='https://example.com'
response = requests.get(url)
html_code = response.text
```

6. Convert HTML Code Elements to Array List (BEAUTIFULSOUP, HTML5LIB):

Use BeautifulSoup to parse HTML code and extract data.

soup = BeautifulSoup(html_code, 'html5lib')
tables = soup.find_all('table')

7. Interact with and identify elements in the Array list (FIND_ALL, TABLE, FOR, ENUMERATE):

Iterate over elements in a list and extract specific information.

for i, table in enumerate(tables):
  print(f"Table {i + 1}:")
  rows = table.find_all('tr')
  for row in rows:
      columns = row.find_all('td')
      for col in columns:
          print(col.text)

8. Create Dataframe (PANDAS, COLUMNS):

Use the Pandas library to create a DataFrame with the extracted data.
data = {'Column1': [value1, value2, ...
```
], 'Column2': [value1, value2, ...] }
df = pd. DataFrame(data, columns=['Column1', 'Column2'])
```

9. Add Records to the Dataframe (APPEND):

Add new records to the DataFrame as needed.

new_data = {'Column1': [new_value1], 'Column2': [new_value2]}
df = df.append(pd. DataFrame(new_data), ignore_index=True)

10. Visualize existing data in Dataframe:

- Visualize DataFrame data.

 print(df)

11. Save the Dataframe logs to a file in Parquet format (PANDAS, TO_PARQUET, AZURE DATA LAKE STORAGE):- Save the dataframe logs to

a Parquet file.

 df.to_parquet('output.parquet')
   # Upload to Azure Data Lake Storage

12. Query the logs in SQL script:

- Run SQL queries in Azure Synapse Analytics to analyze or manipulate the data.

 -- SQL Query
   spark.sql("SELECT * FROM my_table").show()

This content contains

Content Video
Language Portuguese
Duration 10m 39s
Subtitles Sim
Reading time 2 min 24 seg

Fabio Santos

Data Scientist and Consultant for Digital and Analytics Solutions

Youtube Channel

@fabioms

Comparar banco de dados Access com as versões do SQL Server

2019, setembro 22, domingo
Vídeo

Migrar pacotes Integration Services para Azure Data Factory

2023, abril 05, quarta
Vídeo