.cargo | ||
.devcontainer | ||
.github/workflows | ||
.vscode | ||
src | ||
tests | ||
.dockerignore | ||
.gitignore | ||
Cargo.lock | ||
Cargo.toml | ||
CONTRIBUTING.md | ||
docker-compose-surrealdb.dev.yml | ||
docker-compose-surrealdb.yml | ||
Dockerfile | ||
flake.lock | ||
flake.nix | ||
justfile | ||
LICENSE-Apache | ||
LICENSE-MIT | ||
README.md | ||
Useful queries.md |
Wikidata to SurrealDB
A tool for converting Wikidata dumps to a SurrealDB database. Either From a bz2 or json file.
The surrealdb database is ~2.6GB uncompressed or 0.5GB compressed, while the bz2 file is ~80GB, gzip file is ~130GB, and the uncompressed json file is over 1TB.
Building the database on a 7600k takes ~55 hours, using ThreadedSingle, using a cpu with more cores should be faster.
Getting The Data
https://www.wikidata.org/wiki/Wikidata:Data_access
From bz2 file ~80GB
Dump: Docs
Download - latest-all.json.bz2
From json file
Linked Data Interface: Docs
https://www.wikidata.org/wiki/Special:EntityData/Q60746544.json
https://www.wikidata.org/wiki/Special:EntityData/P527.json
Install
Copy docker-compose-surrealdb.yml
Create data folder next to docker-compose.yml and .env, place data inside, and set the data type in .env
├── data
│ ├── Entity.json
│ ├── latest-all.json.bz2
│ ├── filter.surql
│ ├── surrealdb
│ └── temp
├── justfile
├── docker-compose.yml
└── .env
Then run:
just up surrealdb
Exit with:
just down surrealdb
View Progress
just view
Example .env
DB_USER=root
DB_PASSWORD=root
WIKIDATA_LANG=en
WIKIDATA_FILE_FORMAT=bz2
WIKIDATA_FILE_NAME=data/latest-all.json.bz2
# If not using docker file for Wikidata to SurrealDB, use 0.0.0.0:8000
WIKIDATA_DB_PORT=surrealdb:8000
CREATE_VERSION=Bulk
#FILTER_PATH=data/filter.surql
Env string CREATE_VERSION must be in the enum CREATE_VERSION
pub enum CreateVersion {
#[default]
Bulk,
/// must create a `filter.surql` file in the data directory
BulkFilter,
}
filter.surql examples
Dev Install
How to Query
namespace = wikidata
database = wikidata
See Useful queries.md
Table Schema
SurrealDB Thing
pub struct Thing {
pub table: String,
pub id: Id, // i64
}
Tables: Entity, Property, Lexeme
pub struct EntityMini {
pub id: Option<Thing>,
pub label: String,
// Claims Table
pub claims: Thing,
pub description: String,
}
Table: Claims
pub struct Claim {
pub id: Thing,
pub value: ClaimData,
}
ClaimData
Docs
pub enum ClaimData {
// Entity, Property, Lexeme Tables
Thing(Thing),
ClaimValueData(ClaimValueData),
}
Similar Projects
License
All code in this repository is dual-licensed under either License-MIT or LICENSE-APACHE at your option. This means you can select the license you prefer. Why dual license.