We'll get to know the techniques:
1. Use notebook with Apache Spark instance attached and Python programming language (ATTACH TO, LANGUAGE, PYSPARK):In
Azure Synapse Analytics, you can create notebooks using Apache Spark instance and the Python language. The ATTACH TO command is used to connect the notebook to the Spark instance, and the language is set to Python with LANGUAGE.
-- Command to attach to Spark in Python
%pyspark
2. View
the resources associated with the Apache Spark instance (CONFIGURE SESSION):
To view and configure the resources associated with the Spark instance, you can use commands such as CONFIGURE SESSION. This allows you to configure settings specific to the Spark session.
-- Configure Spark session resources
spark.conf.set("spark.some.config.option", "config-value")
3. Identify the installed packages of the Apache Spark instance (PKG_RESOURCES, WORKING_SET, PRINT):
You can identify the Python packages installed on the Spark instance using libraries such as pkg_resources and working_set.
-- Identify installed packages
import pkg_resources
for package in pkg_resources.working_set:
print(package)
4. Import Python packages (PANDAS, REQUESTS, BEAUTIFULSOUP):
Import Python libraries needed for data manipulation, web requests, and scraping.
Import Pandas as PD
import requests
from bs4 import BeautifulSoup
5. Get web page content (REQUESTS, HTML CODE, ELEMENTS, TABLES, ROWS, COLUMNS):
6. Convert HTML Code Elements to Array List (BEAUTIFULSOUP, HTML5LIB):
Use BeautifulSoup to parse HTML code and extract data.
soup = BeautifulSoup(html_code, 'html5lib')
tables = soup.find_all('table')
7. Interact with and identify elements in the Array list (FIND_ALL, TABLE, FOR, ENUMERATE):
Iterate over elements in a list and extract specific information.
for i, table in enumerate(tables):
print(f"Table {i + 1}:")
rows = table.find_all('tr')
for row in rows:
columns = row.find_all('td')
for col in columns:
print(col.text)
8. Create Dataframe (PANDAS, COLUMNS):
Use the Pandas library to create a DataFrame with the extracted data.
data = {'Column1': [value1, value2, ...], 'Column2': [value1, value2, ...] }
df = pd. DataFrame(data, columns=['Column1', 'Column2'])
9. Add Records to the Dataframe (APPEND):
Add new records to the DataFrame as needed.
new_data = {'Column1': [new_value1], 'Column2': [new_value2]}
df = df.append(pd. DataFrame(new_data), ignore_index=True)
10. Visualize existing data in Dataframe:
- Visualize DataFrame data.
print(df)
11. Save the Dataframe logs to a file in Parquet format (PANDAS, TO_PARQUET, AZURE DATA LAKE STORAGE):- Save the dataframe logs to
a Parquet file.
df.to_parquet('output.parquet')
# Upload to Azure Data Lake Storage
12. Query the logs in SQL script:
- Run SQL queries in Azure Synapse Analytics to analyze or manipulate the data.
-- SQL Query
spark.sql("SELECT * FROM my_table").show()