Administrative Functions Reference¶
This page provides detailed reference documentation for all administrative functions available in the Micromegas admin module.
Table Functions (UDTFs)¶
list_view_sets()¶
Description: Lists all available view sets with their current schema information.
Usage:
Returns: Table with columns:
| Column | Type | Description |
|---|---|---|
view_set_name |
String | Name of the view set (e.g., "log_entries", "measures") |
current_schema_hash |
Binary | Version identifier for current schema (e.g., [4]) |
schema |
String | Full schema as formatted string |
has_view_maker |
Boolean | Whether view set supports non-global instances |
global_instance_available |
Boolean | Whether a global instance exists |
Example:
-- List all view sets with schema versions
SELECT view_set_name, current_schema_hash, has_view_maker
FROM list_view_sets()
ORDER BY view_set_name;
-- Find view sets with specific schema version
SELECT * FROM list_view_sets()
WHERE current_schema_hash = '[4]';
Performance: Fast operation, queries in-memory view registry.
list_partitions()¶
Description: Lists all partitions in the lakehouse with metadata.
Usage:
Returns: Table with columns including:
| Column | Type | Description |
|---|---|---|
view_set_name |
String | View set name |
view_instance_id |
String | Instance ID or 'global' |
begin_insert_time |
Timestamp | Partition start time |
end_insert_time |
Timestamp | Partition end time |
min_event_time |
Timestamp | Earliest event time in partition |
max_event_time |
Timestamp | Latest event time in partition |
updated |
Timestamp | Last update time |
file_path |
String | Object storage file path |
file_size |
Integer | File size in bytes |
file_schema_hash |
Binary | Schema version when partition was created |
source_data_hash |
Binary | Hash of the source data |
num_rows |
Integer | Number of rows in the partition |
Example:
-- List partitions for specific view set
SELECT file_path, file_size, file_schema_hash
FROM list_partitions()
WHERE view_set_name = 'log_entries';
-- Find partitions by schema version
SELECT view_set_name, COUNT(*) as partition_count
FROM list_partitions()
WHERE file_schema_hash = '[3]'
GROUP BY view_set_name;
Performance: Queries database metadata table, indexed by view_set_name.
retire_partitions(view_set, view_instance, start_time, end_time)¶
Description: Retires partitions within a specified time range.
Parameters:
- view_set (String): Target view set name
- view_instance (String): Target view instance ID
- start_time (Timestamp): Start of time range (inclusive)
- end_time (Timestamp): End of time range (inclusive)
Usage:
SELECT * FROM retire_partitions('log_entries', 'process-123', '2024-01-01T00:00:00Z', '2024-01-02T00:00:00Z');
Returns: Table with retirement operation results.
Safety: Uses database transactions for atomicity. All partitions in time range are retired.
Time-Based Retirement
This function retires ALL partitions in the specified time range, regardless of schema compatibility. Use with caution.
Scalar Functions (UDFs)¶
delete_duplicate_processes()¶
Description: Deletes duplicate processes within the query time range. A duplicate is defined as multiple rows with the same process_id. This function keeps the row with the earliest insert_time and deletes all others.
Parameters: None (time range is set via the Python client)
Returns: String message (e.g., "Deleted 42 duplicate processes")
Destructive Operation
This function permanently deletes data. The time range identifies duplicates, but ALL duplicate rows for matching IDs are deleted regardless of their insert_time.
delete_duplicate_streams()¶
Description: Deletes duplicate streams within the query time range. A duplicate is defined as multiple rows with the same stream_id. This function keeps the row with the earliest insert_time and deletes all others.
Parameters: None (time range is set via the Python client)
Returns: String message (e.g., "Deleted 15 duplicate streams")
Destructive Operation
This function permanently deletes data. The time range identifies duplicates, but ALL duplicate rows for matching IDs are deleted regardless of their insert_time.
delete_duplicate_blocks()¶
Description: Deletes duplicate blocks within the query time range. A duplicate is defined as multiple rows with the same block_id. This function keeps the row with the earliest insert_time and deletes all others.
Parameters: None (time range is set via the Python client)
Returns: String message (e.g., "Deleted 100 duplicate blocks")
Destructive Operation
This function permanently deletes data. The time range identifies duplicates, but ALL duplicate rows for matching IDs are deleted regardless of their insert_time.
retire_partition_by_metadata(view_set_name, view_instance_id, begin_insert_time, end_insert_time)¶
Description: Surgically retires a single partition by its metadata identifiers. This is the preferred method for retiring partitions as it works for both empty partitions (file_path=NULL) and non-empty partitions.
Parameters:
- view_set_name (String): Name of the view set
- view_instance_id (String): Instance ID (e.g., process_id or 'global')
- begin_insert_time (Timestamp): Begin insert time of the partition
- end_insert_time (Timestamp): End insert time of the partition
Usage:
SELECT retire_partition_by_metadata(
'log_entries',
'process-123',
TIMESTAMP '2024-01-01 00:00:00',
TIMESTAMP '2024-01-01 01:00:00'
) as result;
Returns: String message indicating success or failure:
- Success: "SUCCESS: Retired partition <view_set>/<instance> [<begin>, <end>)"
- Failure: "ERROR: Partition not found: <view_set>/<instance> [<begin>, <end>)"
Safety: - Surgical precision - only targets the exact specified partition by its natural identifiers - Works for both empty partitions (file_path=NULL) and non-empty partitions - Uses database transactions with automatic rollback on batch errors - Files are scheduled for cleanup rather than immediately deleted
Example:
-- Retire specific partition by metadata
SELECT retire_partition_by_metadata(
'log_entries',
'process-123',
TIMESTAMP '2024-01-01 00:00:00',
TIMESTAMP '2024-01-01 01:00:00'
);
-- Batch retire incompatible partitions
SELECT
view_set_name,
view_instance_id,
retire_partition_by_metadata(
view_set_name,
view_instance_id,
begin_insert_time,
end_insert_time
) as result
FROM list_partitions() p
JOIN list_view_sets() vs ON p.view_set_name = vs.view_set_name
WHERE p.file_schema_hash != vs.current_schema_hash
LIMIT 10;
Batch Behavior: When called with multiple rows in a single query, all operations are executed within a single database transaction. If any retirement fails, all changes are rolled back and a ROLLED_BACK message is appended indicating the number of reverted changes.
Performance: Single partition operation, very fast with appropriate database indexes.
retire_partition_by_file(file_path)¶
Description: Retires a single partition by exact file path match.
Prefer metadata-based retirement
For new code, prefer retire_partition_by_metadata() which works for both empty and non-empty partitions.
Parameters:
- file_path (String): Exact file path of partition to retire
Usage:
SELECT retire_partition_by_file('s3://bucket/data/log_entries/process-123/2024/01/01/file.parquet') as result;
Returns: String message indicating success or failure:
- Success: "SUCCESS: Retired partition <file_path>"
- Failure: "ERROR: Partition not found: <file_path>"
Limitation: Cannot retire empty partitions (where file_path is NULL). Use retire_partition_by_metadata() for empty partitions.
Performance: Single partition operation, very fast with appropriate database indexes.
Python API Functions¶
micromegas.admin.list_incompatible_partitions(client, view_set_name=None)¶
Description: Identifies partitions with schemas incompatible with current schema versions. Returns one row per incompatible partition for precise targeting.
Parameters:
- client (FlightSQLClient): Connected Micromegas client
- view_set_name (str, optional): Filter to specific view set
Returns: pandas DataFrame with columns:
| Column | Type | Description |
|---|---|---|
view_set_name |
str | View set name |
view_instance_id |
str | Instance ID |
begin_insert_time |
timestamp | Begin insert time of the partition |
end_insert_time |
timestamp | End insert time of the partition |
incompatible_schema_hash |
str | Old schema version in partition |
current_schema_hash |
str | Current schema version |
file_path |
str | File path for the partition (NULL for empty partitions) |
file_size |
int | Size in bytes of the partition file (0 for empty partitions) |
Example:
import micromegas
import micromegas.admin
client = micromegas.connect()
# List all incompatible partitions
incompatible = micromegas.admin.list_incompatible_partitions(client)
print(f"Found {len(incompatible)} incompatible partitions")
# List for specific view set
log_incompatible = micromegas.admin.list_incompatible_partitions(client, 'log_entries')
print(f"Log entries: {len(log_incompatible)} incompatible partitions")
print(f"Total size: {log_incompatible['file_size'].sum()} bytes")
# Check for empty partitions (file_path is NULL)
empty_partitions = incompatible[incompatible['file_path'].isna()]
print(f"Empty partitions: {len(empty_partitions)}")
Implementation: Uses SQL JOIN between list_partitions() and list_view_sets() with server-side filtering.
Performance: Efficient server-side processing, minimal network overhead.
micromegas.admin.retire_incompatible_partitions(client, view_set_name=None)¶
Description: Safely retires partitions with incompatible schemas using metadata-based retirement. This handles both empty partitions (file_path=NULL) and non-empty partitions.
Parameters:
- client (FlightSQLClient): Connected Micromegas client
- view_set_name (str, optional): Filter to specific view set
Returns: pandas DataFrame with columns:
| Column | Type | Description |
|---|---|---|
view_set_name |
str | View set processed |
view_instance_id |
str | Instance ID processed |
partitions_retired |
int | Count of successfully retired partitions |
partitions_failed |
int | Count of partitions that failed to retire |
storage_freed_bytes |
int | Total bytes freed from storage |
retirement_messages |
list | Detailed messages for each retirement attempt |
Example:
import micromegas
import micromegas.admin
client = micromegas.connect()
# Preview what would be retired
preview = micromegas.admin.list_incompatible_partitions(client, 'log_entries')
print(f"Will retire {len(preview)} partitions")
print(f"Will free {preview['file_size'].sum() / (1024**3):.2f} GB")
# Retire incompatible partitions
result = micromegas.admin.retire_incompatible_partitions(client, 'log_entries')
for _, row in result.iterrows():
print(f"Retired {row['partitions_retired']} partitions from {row['view_set_name']}")
print(f"Failed {row['partitions_failed']} partitions")
print(f"Freed {row['storage_freed_bytes']} bytes")
Safety Features:
- Uses retire_partition_by_metadata() for surgical precision
- Works for both empty partitions (file_path=NULL) and non-empty partitions
- Cannot accidentally retire compatible partitions
- Comprehensive error handling with detailed messages
- Continues processing even if individual partitions fail
- Results grouped by view_set_name and view_instance_id for clarity
Implementation:
1. Calls list_incompatible_partitions() to identify targets (one row per partition)
2. Groups partitions by view_set_name and view_instance_id
3. For each partition, calls retire_partition_by_metadata() with the partition's natural identifiers
4. Aggregates results and provides summary statistics per group
5. Includes detailed operation logs for auditing
Performance: Processes partitions individually for safety, efficient for typical partition counts.
Complex Query Examples¶
Find Schema Migration Candidates¶
-- Identify view sets with the most incompatible partitions
SELECT
vs.view_set_name,
vs.current_schema_hash,
COUNT(DISTINCT p.file_schema_hash) as schema_versions_count,
SUM(CASE WHEN p.file_schema_hash != vs.current_schema_hash THEN 1 ELSE 0 END) as incompatible_count,
SUM(CASE WHEN p.file_schema_hash != vs.current_schema_hash THEN p.file_size ELSE 0 END) as incompatible_size_bytes
FROM list_view_sets() vs
LEFT JOIN list_partitions() p ON vs.view_set_name = p.view_set_name
GROUP BY vs.view_set_name, vs.current_schema_hash
HAVING incompatible_count > 0
ORDER BY incompatible_size_bytes DESC;
Analyze Partition Age Distribution¶
-- Find old incompatible partitions that are candidates for retirement
SELECT
p.view_set_name,
p.file_schema_hash as old_schema,
vs.current_schema_hash,
COUNT(*) as partition_count,
MIN(p.end_insert_time) as oldest_partition,
MAX(p.end_insert_time) as newest_partition,
SUM(p.file_size) as total_size_bytes
FROM list_partitions() p
JOIN list_view_sets() vs ON p.view_set_name = vs.view_set_name
WHERE p.file_schema_hash != vs.current_schema_hash
AND p.end_insert_time < NOW() - INTERVAL '30 days'
GROUP BY p.view_set_name, p.file_schema_hash, vs.current_schema_hash
ORDER BY oldest_partition ASC;
Storage Impact Analysis¶
-- Calculate storage savings from retiring incompatible partitions
WITH incompatible_summary AS (
SELECT
p.view_set_name,
COUNT(*) as incompatible_partitions,
SUM(p.file_size) as incompatible_size_bytes
FROM list_partitions() p
JOIN list_view_sets() vs ON p.view_set_name = vs.view_set_name
WHERE p.file_schema_hash != vs.current_schema_hash
GROUP BY p.view_set_name
),
total_summary AS (
SELECT
view_set_name,
COUNT(*) as total_partitions,
SUM(file_size) as total_size_bytes
FROM list_partitions()
GROUP BY view_set_name
)
SELECT
t.view_set_name,
COALESCE(i.incompatible_partitions, 0) as incompatible_partitions,
t.total_partitions,
ROUND(100.0 * COALESCE(i.incompatible_partitions, 0) / t.total_partitions, 2) as incompatible_percentage,
COALESCE(i.incompatible_size_bytes, 0) as incompatible_size_bytes,
t.total_size_bytes,
ROUND(100.0 * COALESCE(i.incompatible_size_bytes, 0) / t.total_size_bytes, 2) as size_percentage
FROM total_summary t
LEFT JOIN incompatible_summary i ON t.view_set_name = i.view_set_name
ORDER BY size_percentage DESC;