Scheduler, AKA Dagster
About
The following is a description of the steps and requirements for building and deploying the docker based workflow implemented in dagster.
Overview
The image following provides a broad overview of the elements that are loaded in to the Docker orchestration environment. This is a very basic view and doesn't present any scaling or fail over elements.
The key elements are:
- sources to configuration to load into the Gleaner and Nabu tools, and push to the triplestore. These are now stored in
a s3 location
- gleaner configuration. a list of sources to load. (NOTE: This is also a docker config that needs to be updated to match to make things work)
- tenant configuration. a list communities, and which sources they load
- nabu configuration
- The Dagster set which loads three containers to support workflow operations
- The Gleaner Architecture images which loads three or more containers to support
- s3 object storage
- graph database (triplestore)
- headless chrome for page rendering to support dynamically inserted JSON-LD
- any other support packages like text, semantic or spatial indexes
WORKFLOWS
There are three workflows * ingest works to load sources * tasks weekly task * custom - ecrr - loads Earthcube Resource Registry
basic deployment
- information for environment variables is created
- The configuration files are created and loaded to s3, and docker/config
- a docker stack for dagster/scheduler created, and the environment variables are added.
- portainer deploys stack
- a docker stack for an ingest service is created, and the environment variables are added.
- portainer deploys stack
- initial configuration jobs for ingest and tasks are executed, they read the gleaner and tenant configurations
- when complete, they request loading runs for the sources from gleaner
- when a loading run is complete, a sensor triggers, and a release is loaded to a tenant
Ingest Workflow
Task workflows
Steps to build and deploy
The deployment can be developed locally. You can run jobs and materialize assets from the command line
You can set up a services stack in docker to locally test, or use existing services.
The production 'containers' dagster, gleaner, and nabu are built with a GitHub action. You can also use a makefile.
This describes the local and container deployment We use portainer to manage our docker deployments.
Server Deployment.
Production example for Earthcube
DEVELOPER Pycharm -- Run local with remote services
You can test components in pycharm. Run configurations for pycgharm are in runConfigurations (TODO: Instructions)
use the ENVFIle plugin.
- move to the implnets/deployment directory
- copy the envFile.env to .env see use the ENVFIle plugin.
- edit the entries to point at a portainer/traefik with running services
- edit configuration files in implnets/configs/PROJECT: gleanerconfig.yaml, tenant.yaml
- upload configuration implnets/configs/PROJECT to s3 scheduler/configs: gleanerconfig.yaml, tenant.yaml
- run a Pycharm runconfig
- eg dagster_ingest_debug
- go to http://localhost:3000/
- you can test the schedules
full stack test Run local with remote services
- move to the implnets/deployment directory
- copy the envFile.env to .env seeuse the ENVFIle plugin. see use the ENVFIle plugin.
- edit the entries.
- edit configuration files in scheduler/configs/PROJECT to s3: gleanerconfig.yaml, tenant.yaml
- upload configuration scheduler/configs/PROJECT to scheduler/configs s3: gleanerconfig.yaml, tenant.yaml
- for local,
./dagster_localrun.sh
- go to http://localhost:3000/
To deploy in portainer, use the deployment/compose_project.yaml docker stack.
docker compose Configuration:
there are configuration files that are needed. They are installed in two places: * as docker configs * as scheduler configs in S3
(NOTE: I think the configs are still needed in the containers)
file | local | note | |
---|---|---|---|
workspace | configs/local/worksapce.yaml | dockerconfig: workspace | docker compose: used by dagster |
gleanerconfig.yaml | configs/PROJECT/gleanerconfig.yaml | s3:{bucket}/scheduler/configs/gleanerconfigs.yaml | ingest workflow needs to be in minio/s3 |
tenant.yaml | configs/PROJECT/tenant.yaml | s3:{bucket}/scheduler/configs/tenant.yaml | ingest workflow needs to be in minio/s3 |
dagster.yaml | dagster/implnets/deployment/dagster.yaml | dockerconfig: dagster | docker compose: used by dagster |
gleanerconfig.yaml | configs/PROJECT/gleanerconfig.yaml | read from s3url by gleaner | |
nabuconfig.yaml | configs/PROJECT/nabuconfig.yaml | read from s3 url by nabu |
(NOTE: This is also a gleaner config (below in runtime configuration) that needs to be updated to mactch to make things work)
Docker Configs for gleanerio containers are still needed:
file | local | stack | note |
---|---|---|---|
gleanerconfig.yaml | configs/PROJECT/gleanerconfigs.yaml | env () | |
nabuconfig.yaml | configs/PROJECT/nabuconfigs.yaml | env () |
- when the containers are running in a stack, on portainer, there will need to be updated by pulling from dockerhub. The ENV variables may need to be updated for the CONTAINER*_TAG
Runtime configuration
upload to an s3 bucket
file | local | note | |
---|---|---|---|
gleanerconfig.yaml | s3:{bucket}/scheduler/configs/gleanerconfigs.yaml | ingest workflow needs to be in minio/s3 | |
nabuconfig.yaml | s3:{bucket}/scheduler/configs/nabuconfig.yaml | ingest workflow needs to be in minio/s3 | |
tenant.yaml | s3:{bucket}/scheduler/configs/tenant.yaml | ingest workflow needs to be in minio/s3 |
updating config
You can update a config, and a sensor should pick up the changes.
1. Upload changed file to s3
2) note, if this is a new source, you need to add it to the docker config (gleaner-PROJECT).
1. go to overview,
1. go to s3_config_source_sensor for gleanerconfig.yaml changes, and s3_config_tenant_sensor for tenant.yaml changes
.
1. at some point, a run should occur.
.
1. then go to the sources_sensor, or tenant sensor
if job does not run, you can do a backfill.
new sources:
- so to job tab, and run summon_and_release with the 'partitions' aka 'sources' that are recent.
- click materialize_all, and in the backfill dialog be sure only the added partition is selected.
.
- go to runs, and see that a job with a partition with that name is queued/running
- run tenant_release_job with same partition name to load data to tenants
new tenants:
There are two jobs that need to run to move data to a tenant. (third will be needed for UI) 1. so to job tab, and run tenant_namespaces_job with the 'partitions' aka 'tenant' that are recent.' 1. click materialize_all, and be sure only the added partition is selected 1. go to runs, and see that a job with a partition with that name is queded,/running 1. so to job tab, and run tenant_release_job with the 'partitions' aka 'sources' for that tenant 1. click materialize_all, The data will be pushed to all tenant namespaces
test schedules
Environment files
- cp deployment/envFile.env .env
- edit
export $(cat .env | xargs)
export $(cat .env | xargs)# DAGSTER_ FOR LOCAL DEVELOPMENT DAGSTER_HOME=dagster/dagster_home DAGSTER_LOCAL_ARTIFACT_STORAGE_DIR=/Users/valentin/development/dev_earthcube/scheduler/dagster/dagster_home/ # dagster network and volume #GLEANERIO_DAGSTER_STORAGE=dagster_storage #GLEANERIO_DAGSTER_NETWORK=dagster_host ## PROJECT -- default 'eco' this is a 'TRAEFIK router name' use to run multiple copies of scheduler on a server # originally used to generate code for a specific project #PROJECT=test #PROJECT=eco #PROJECT=iow #PROJECT=oih # ### # workspace for dagster #### GLEANERIO_WORKSPACE_CONFIG_PATH=/usr/src/app/workspace.yaml GLEANERIO_DOCKER_WORKSPACE_CONFIG=workspace-eco GLEANERIO_DOCKER_DAGSTER_CONFIG=dagster DEBUG_CONTAINER=false #### HOST # host base name for treafik. fixed to localhost:3000 when using compose_local. HOST=localhost # Applies only to compose_project.yaml runs # modify SCHED_HOSTNAME is you want to run more than one instance # aka two different project havests for now. SCHED_HOSTNAME=sched GLEANERIO_DOCKER_CONTAINER_WAIT_TIMEOUT=300 # debugging set to 10 - 30 seconds # DEFAULT SCHEDULE # as defined by https://docs.dagster.io/concepts/partitions-schedules-sensors/schedules#basic-schedules # "@hourly", "@daily", "@weekly", and "@monthly" #GLEANERIO_DEFAULT_SCHEDULE=@weekly #GLEANERIO_DEFAULT_SCHEDULE_TIMEZONE=America/Los_Angeles # the above a used as hard coded os.getenv(), so when changed, service needs to be restarted. # tags for docker compose CONTAINER_CODE_TAG=latest CONTAINER_DAGSTER_TAG=latest PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python # port is required: https://portainer.{HOST}:443/api/endpoints/9/docker/ # 9 is dataloader, 2 is aws-dev GLEANERIO_DOCKER_URL=https://portainer.{HOST}:443/api/endpoints/9/docker/ GLEANERIO_PORTAINER_APIKEY= # if running dagster-dev, then this needs to be set , # defaults to "/scheduler/gleanerconfig.yaml" which is path to config mounted in containers # when debugging generated code "../../../configs/eco/gleanerconfig.yaml" # when debugging code in workflows "../../configs/eco/gleanerconfig.yaml" GLEANERIO_DAGSTER_CONFIG_PATH=../../../configs/eco/gleanerconfig.yaml # Network GLEANERIO_DOCKER_HEADLESS_NETWORK=headless_gleanerio ### GLEANER/NABU Dockers GLEANERIO_GLEANER_IMAGE=nsfearthcube/gleaner:dev_ec GLEANERIO_NABU_IMAGE=nsfearthcube/nabu:dev_eco ### #path in s3 for docker log files GLEANERIO_LOG_PREFIX=scheduler/logs/ GLEANERIO_MINIO_ADDRESS= GLEANERIO_MINIO_PORT=80 GLEANERIO_MINIO_USE_SSL=false GLEANERIO_MINIO_BUCKET= GLEANERIO_MINIO_ACCESS_KEY= GLEANERIO_MINIO_SECRET_KEY= # # where are the gleaner and tennant configurations GLEANERIO_CONFIG_PATH=scheduler/configs/test/ GLEANERIO_TENANT_FILENAME=tenant.yaml GLEANERIO_SOURCES_FILENAME=gleanerconfig.yaml GLEANERIO_DOCKER_NABU_CONFIG=nabuconfig.yaml ### #path in s3 for docker log files GLEANERIO_LOG_PREFIX=scheduler/logs/ ### # graph #### # just the base address, no namespace https://graph.geocodes-aws-dev.earthcube.org/blazegraph GLEANERIO_GRAPH_URL=https://graph.geocodes-aws.earthcube.org/blazegraph GLEANERIO_GRAPH_NAMESPACE=earthcube GLEANERIO_CSV_CONFIG_URL=https://docs.google.com/spreadsheets/d/e/2PACX-1vTt_45dYd5LMFK9Qm_lCg6P7YxG-ae0GZEtrHMZmNbI-y5tVDd8ZLqnEeIAa-SVTSztejfZeN6xmRZF/pub?gid=1340502269&single=true&output=csv
Appendix
Portainer API setup
You will need to setup Portainer to allow for an API call. To do this look at the documentation for Accessing the Portainer API
Notes
- Don't forget to set the DAGSTER_HOME dir like in
export DAGSTER_HOME=/home/fils/src/Projects/gleaner.io/scheduler/python/dagster
Deply as a single stack
We deploy as two stacks for flexibility, but you can do the multiple file, aka override, to deploy the compose-project.yaml and compose-project-ingest.yaml as a single stack.
docker compose -p dagster --env-file $envfile -f compose_project.yaml compose-project-ingest.yam; up -d
Handle Multiple Organizations
There can be multiple ingest containers. These can be used for testing developement deployments, and multiple organizations.
The top level compose-project.yaml handles the dagster. 1) Deploy a compose-project-ingest.yaml stack with a different PROJECT env variables, and minio and graph environment variables to push to each communities repository. 2) configure the workflows configuration in dagstger to include those containers as workflows.
If you need to add workflows, fork the code, and add the branch to the containerize git workflow.
* Each organization can be in a container with its own code workflow.
* in the workflows directory: dagster project projectname
* If we can standardize the loading and transforming workflows as much as possible, then the graph loading workflows
should be standardized. We could just define an additional container in a compose file, and add that to the workflows
load_from:
# - python_file:
# relative_path: "project/eco/repositories/repository.py"
# location_name: project
# working_directory: "./project/eco/"
# - python_file:
# relative_path: "workflows/ecrr/repositories/repository.py"
# working_directory: "./workflows/ecrr/"
# module starting out with the definitions api
# - python_module: "workflows.tasks.tasks"
- grpc_server:
host: dagster-code-eco-tasks
port: 4000
location_name: "eco-tasks"
- grpc_server:
host: dagster-code-eco-ingest
port: 4000
location_name: "eco-ingest"
- grpc_server:
host: dagster-code-oih--tasks
port: 4000
location_name: "oih-tasks"
- grpc_server:
host: dagster-code-oih-ingest
port: 4000
location_name: "oih-ingest"
- grpc_server:
host: dagster-code-eco-ecrr
port: 4000
location_name: "eco-ecrr"
- to add a container, you need to edit the workflows.yaml in an organizations configuration
Cron Notes
A useful on-line tool: https://crontab.cronhub.io/
0 3 * * * is at 3 AM each day
0 3,5 * * * at 3 and 5 am each day
0 3 * * 0 at 3 am on Sunday
0 3 5 * * At 03:00 AM, on day 5 of the month
0 3 5,19 * * At 03:00 AM, on day 5 and 19 of the month
0 3 1/4 * * At 03:00 AM, every 4 days
Indexing Approaches
The following approaches
- Divide up the sources by sitemap and sitegraph
- Also divide by production and queue sources
The above will result in at most 4 initial sets.
We can then use the docker approach
./gleanerDocker.sh -cfg /gleaner/wd/rundir/oih_queue.yaml --source cioosatlantic
to run indexes on specific sources in these configuration files.