Oboarding a Dataseource/Repository
If we have an issue with a repository, let's test it, independently. If we have a new datasource/community, let's test it, independently.
We can do this because:
- s3 paths are repository based, so we can load to an s3 bucket
- blazegraph, we can create two namespaces in our openstack blazegraph stores
- We can setup 'tenant' instances of the UI that connect to (s3 bucket, and project namespaces) services
- we have some source reports to help evaluate the data loading.
So, basically, This assumes we have some basic information about a data source, aka sitemap, and something we want to name this repository.
Note
This is a high level overview that assumes you have loaded data, and do not need any deep details. Over time put the troubleshooting an gotchas below the steps.
Note
if source is large, run in a screen In fact, it is suggested to always run in a screen
Please put any issues/notes in the production/repos google docs
Some steps.
- Grab some urls from the sitemap, evaluate in validator.schema.org
- run check_sitemap to see it url's are good
- setup datastores
- any s3
- independent project and project_summary namespaces
- create gleaner config for repo
- if source is large, use screen e.g.
screen -S gleaner
glcon gleaner batch
Summon to an s3 location. Repos are independent at this point.- did we suggest runing in a screen
- evaluate summon. Look at jsonld. Do they seem like they got loaded?
- thought: do we need a tool to pull a specific url from s3? could filter the listSummonedUrls, we do have getOringalUrl
- run missing stats... may need an option to just check the sitemap>summon portion
glcon nabu prefix
if these look good, then- did we suggest runing in a screen
- there needs to be a note about using
glcon nabu release --cfgName CONFIG
, and how to upload the quads
- run
graph_stats
andmissing...
- graph stats report needs to be updated to include [all/repo]_count_types_top_level.sparql,
- check reports
- feel free to run repo_urn_w_types_toplevel.sparql
- run summarize_* to populate summary
- create a facets configuration for the project, upload to portainer, and create stack of tenant containers to run against the project and project_summary namespaces
- via UI run queries to see that it works.
- humm, add a simple query tester to scripts
- Review with team
- review with datasource
- add to production sources
Evaluating with validator
what to look for
evaluating summon
What to look for
Evaluating graph and missing reports
What to look for
How to test the UI
keywords: * data * repository keywords