A tool for converting Wikidata dumps to a SurrealDB database. Either From a bz2 or json file.
Find a file
2024-09-26 12:49:24 -07:00
.cargo swap to crane 2024-08-07 00:07:31 -07:00
.devcontainer swap to crane 2024-08-07 00:07:31 -07:00
.github/workflows fix: crane 2024-09-18 20:08:48 -07:00
.vscode mulit line json 2023-12-13 15:20:48 -08:00
benches fix: async for init_reader 2024-09-24 14:35:32 -07:00
src fix!: remove OVERWRITE_DB, swap to querry 2024-09-26 12:49:24 -07:00
tests fix!: remove OVERWRITE_DB, swap to querry 2024-09-26 12:49:24 -07:00
.dockerignore swap to crane 2024-08-07 00:07:31 -07:00
.gitignore fix: file path, swap from ws to http 2024-09-18 20:08:48 -07:00
Cargo.lock fix!: remove OVERWRITE_DB, swap to querry 2024-09-26 12:49:24 -07:00
Cargo.toml fix: makefile and update dependences 2024-09-26 10:10:59 -07:00
CONTRIBUTING.md feat: Exponential backoff 2024-09-19 11:22:00 -07:00
docker-compose-surrealdb.dev.yml fix: makefile and update dependences 2024-09-26 10:10:59 -07:00
docker-compose-surrealdb.yml fix: makefile and update dependences 2024-09-26 10:10:59 -07:00
Dockerfile tests 2023-12-17 20:49:29 -08:00
flake.lock fix!: remove OVERWRITE_DB, swap to querry 2024-09-26 12:49:24 -07:00
flake.nix fix: makefile and update dependences 2024-09-26 10:10:59 -07:00
LICENSE-Apache bulk insert and benchmarks 2024-01-21 16:10:05 -08:00
LICENSE-MIT bulk insert and benchmarks 2024-01-21 16:10:05 -08:00
Makefile fix: makefile and update dependences 2024-09-26 10:10:59 -07:00
README.md fix!: remove OVERWRITE_DB, swap to querry 2024-09-26 12:49:24 -07:00
Useful queries.md feat!: surrealdb 2.0, remove CreateVersion::Single, fix OVERWRITE_DB 2024-09-18 20:08:48 -07:00

Wikidata to SurrealDB

A tool for converting Wikidata dumps to a SurrealDB database. Either From a bz2 or json file.

The surrealdb database is ~2.6GB uncompressed or 0.5GB compressed, while the bz2 file is ~80GB, gzip file is ~130GB, and the uncompressed json file is over 1TB.

Building the database on a 7600k takes ~55 hours, using ThreadedSingle, using a cpu with more cores should be faster.

Getting The Data

https://www.wikidata.org/wiki/Wikidata:Data_access

From bz2 file ~80GB

Dump: Docs

Download - latest-all.json.bz2

From json file

Linked Data Interface: Docs

https://www.wikidata.org/wiki/Special:EntityData/Q60746544.json
https://www.wikidata.org/wiki/Special:EntityData/P527.json

Install

Copy docker-compose-surrealdb.yml

Create data folder next to docker-compose.yml and .env, place data inside, and set the data type in .env

├── data
│   ├── Entity.json
│   ├── latest-all.json.bz2
│   ├── filter.surql
│   ├── surrealdb
│   └── temp
├── Makefile
├── docker-compose.yml
└── .env

Then run:

make up-surrealdb

Exit with:

make down-surrealdb

View Progress

make view

Example .env

DB_USER=root
DB_PASSWORD=root
WIKIDATA_LANG=en
WIKIDATA_FILE_FORMAT=bz2
WIKIDATA_FILE_NAME=data/latest-all.json.bz2
# If not using docker file for Wikidata to SurrealDB, use 0.0.0.0:8000
WIKIDATA_DB_PORT=surrealdb:8000
CREATE_VERSION=Bulk
#FILTER_PATH=data/filter.surql

Env string CREATE_VERSION must be in the enum CREATE_VERSION

pub enum CreateVersion {
    #[default]
    Bulk,
    /// must create a `filter.surql` file in the data directory
    BulkFilter,
}

filter.surql examples

Dev Install

How to Query

namespace = wikidata
database = wikidata

See Useful queries.md

Table Schema

SurrealDB Thing

pub struct Thing {
    pub table: String,
    pub id: Id, // i64
}

Tables: Entity, Property, Lexeme

pub struct EntityMini {
    pub id: Option<Thing>,
    pub label: String,
     // Claims Table
    pub claims: Thing,
    pub description: String,
}

Table: Claims

pub struct Claim {
    pub id: Thing,
    pub value: ClaimData,
}

ClaimData

Docs

pub enum ClaimData {
    // Entity, Property, Lexeme Tables
    Thing(Thing), 
    ClaimValueData(ClaimValueData),
}

Similar Projects

License

All code in this repository is dual-licensed under either License-MIT or LICENSE-APACHE at your option. This means you can select the license you prefer. Why dual license.