Micromegas - Scalable Observability

rust API documentation
python API
grafana plugin
design presentation
unreal observability
Objectives
-
Unified observability: logs, metrics and traces in the same database.
-
Spend less time reproducing problems
-
Achieve better quality: monitor & catch problems before they get noticed by users.
Design Strategies
Low overhead instrumentation
20 ns / event in the calling thread, one additional thread for the preparation and upload to the server.
High frequency of events
Up to 100000 events / second for a single instrumented process.
Scalability of ingestion service
Scalable backend can accept data from millions of concurrent instrumented processes.
- Data stored in S3
- Metadata stored in https://www.postgresql.org/
Tail sampling & ETL on demand
In order to keep costs down, most payloads will remain unprocessed until they expire.
Query using SQL
- Analytics built on https://arrow.apache.org/datafusion/
Status
February 2025
- Released version 0.4.0
- Incremental data reduction using sql-defined views
- System monitor thread
- Added support for ARM (& macos)
- Deleted analytics-srv and the custom http python client to connect to it
January 2025
- Released version 0.3.0
- New FlightSQL python API
- Ready to replace analytics-srv with flight-sql-srv
Decembre 2024
Novembre 2024
Released version 0.2.1
- FlightSQL support
- Measures and log entries can now be tagged with properties
- Not yet available in SQL queries
October 2024
Released version 0.2.0
- Unified the query interface
- Using
view_instance
table function to materialize just-in-time process-specific views from within SQL
- Updated python doc to reflect the new API: https://pypi.org/project/micromegas/
Septembre 2024
Released version 0.1.9
- Updating global views every second
- Caching metadata (processes, streams & blocks) in the lakehouse & allow sql queries on them
August 2024
Released version 0.1.7
- New global materialized views for logs & metrics of all processes
- New daemon service to keep the views updated as data is ingested
- New analytics API based on SQL powered by Apache Datafusion
July 2024
Released version 0.1.5
Unreal
- Better reliability, retrying failed http requests
- Spike detection
Maintenance
- Delete old blocks, streams & processes using cron task
June 2024
Released version 0.1.4
Good enough for dogfooding :)
Unreal
- Metrics publisher
- FName scopes
Analytics
- Metric queries
- Convert cpu traces in perfetto format
May 2024
Released version 0.1.3
Better unreal engine instrumentation
- new protocol
- http request callbacks no longer binded to the main thread
- custom authentication of requests
Analytics
- query process metadata
- query spans of a thread
April 2024
Telemetry ingestion from rust & unreal are working :)
Released version 0.1.1
Not actually useful yet, I need to bring back the analytics service to a working state.
January 2024
Starting anew. I’m extracting the tracing/telemetry/analytics code from https://github.com/legion-labs/legion to jumpstart the new project. If you are interested in collaborating, please reach out.