Free cookie consent management tool by TermsFeed Policy Generator

#074 How to Do Web Scraping in Azure Synapse Analytics

Apresentamos nesse video como utilizar a técnica de Web Scraping para extrair dados de páginas da internet utilizando o Python no Azure Synapse Analytics.

We'll get to know the techniques:

1. Use notebook with Apache Spark instance attached and Python programming language (ATTACH TO, LANGUAGE, PYSPARK):In

  • Azure Synapse Analytics, you can create notebooks using Apache Spark instance and the Python language. The ATTACH TO command is used to connect the notebook to the Spark instance, and the language is set to Python with LANGUAGE.

    -- Command to attach to Spark in Python
    %pyspark
    

2. View

the resources associated with the Apache Spark instance (CONFIGURE SESSION):
  • To view and configure the resources associated with the Spark instance, you can use commands such as CONFIGURE SESSION. This allows you to configure settings specific to the Spark session.

    -- Configure Spark session resources
    spark.conf.set("spark.some.config.option", "config-value")
    

3. Identify the installed packages of the Apache Spark instance (PKG_RESOURCES, WORKING_SET, PRINT):

  • You can identify the Python packages installed on the Spark instance using libraries such as pkg_resources and working_set.

    -- Identify installed packages
    import pkg_resources
    for package in pkg_resources.working_set:
      print(package)
    

4. Import Python packages (PANDAS, REQUESTS, BEAUTIFULSOUP):

  • Import Python libraries needed for data manipulation, web requests, and scraping.

    Import Pandas as PD
    import requests
    from bs4 import BeautifulSoup
    

5. Get web page content (REQUESTS, HTML CODE, ELEMENTS, TABLES, ROWS, COLUMNS):

  • Use libraries such as requests and BeautifulSoup to retrieve the content of a web page and extract information.

    url='https://example.com'
    response = requests.get(url)
    html_code = response.text
    

6. Convert HTML Code Elements to Array List (BEAUTIFULSOUP, HTML5LIB):

  • Use BeautifulSoup to parse HTML code and extract data.

    soup = BeautifulSoup(html_code, 'html5lib')
    tables = soup.find_all('table')
    

7. Interact with and identify elements in the Array list (FIND_ALL, TABLE, FOR, ENUMERATE):

  • Iterate over elements in a list and extract specific information.

    for i, table in enumerate(tables):
      print(f"Table {i + 1}:")
      rows = table.find_all('tr')
      for row in rows:
          columns = row.find_all('td')
          for col in columns:
              print(col.text)
    

8. Create Dataframe (PANDAS, COLUMNS):

  • Use the Pandas library to create a DataFrame with the extracted data.

    data = {'Column1': [value1, value2, ...
    ], 'Column2': [value1, value2, ...] }
    df = pd. DataFrame(data, columns=['Column1', 'Column2'])
    

9. Add Records to the Dataframe (APPEND):

  • Add new records to the DataFrame as needed.

    new_data = {'Column1': [new_value1], 'Column2': [new_value2]}
    df = df.append(pd. DataFrame(new_data), ignore_index=True)
    

10. Visualize existing data in Dataframe:

- Visualize DataFrame data.
 print(df)

11. Save the Dataframe logs to a file in Parquet format (PANDAS, TO_PARQUET, AZURE DATA LAKE STORAGE):- Save the dataframe logs to

a Parquet file.
 df.to_parquet('output.parquet')
   # Upload to Azure Data Lake Storage

12. Query the logs in SQL script:

- Run SQL queries in Azure Synapse Analytics to analyze or manipulate the data.
 -- SQL Query
   spark.sql("SELECT * FROM my_table").show()

This content contains
  • Content Video
  • Language Portuguese
  • Duration 10m 39s
  • Subtitles Sim

  • Reading time 2 min 24 seg

avatar
Fabio Santos

Data Scientist and Consultant for Digital and Analytics Solutions


  • Share

Youtube Channel

@fabioms

Subscribe now