上級分類: Database technology

Document based databases

How should we use document based databases?


MarkLogic, MongoDB, couchbase are document databases.

They're often very fast.

But unfortunately compared to relational databases they often do not support efficient joins.


投票 (可選) (別通知) (可選)


文檔數據庫的一個共同弱點是它們通常不支持有效的連接。它們將文檔記錄表示爲不透明的 blob。


How to build a document based database that supports Joins?

A common weakness of document databases is that they often do not support efficient joins. They represent document records as an opaque blob.

I created the attached project to talk of how I plan to implement document based storage using keyvalue storage as the backend. Each key of the document is a separate key and they are arranged so that range scans are efficient so that hash joins can take place efficiently.

    : Bassxn2
    :  -- 
    :  -- 



首先,這個問題已經在 SQL 數據庫中解決了,對吧?爲什麼不看一下實現,然後從那裏開始呢?

假設我們將原始數據作爲 JSON(或字典、哈希圖)的記錄。那麼您關心的是高效查詢,這是索引(查詢優化或查詢算法)的主題。我們例行地將 SQL 數據庫索引到 ElasticSearch 中,因爲 SQL 數據庫在用戶關心的其他方式的文本搜索中不夠好或不夠靈活:我們使用其他數據系統,擅長它,並在其中保留數據副本。不是很節省空間,但很有效。我們可以對 NoSQL 做同樣的事情——如果你需要類似連接的查詢——只需通過專門的進程將數據“索引”到 SQL 數據庫中,該進程即時解釋和遷移 SQL 數據庫,作爲與NoSQL,總是在尋找新的字段,並在補充的 SQL 數據庫中創建這些字段。當然,一次使用多個數據庫並不是一個優雅的解決方案,所以我同意,我們需要改進基於文檔的數據庫。畢竟,模式並非不存在,每條記錄都暗示着某種模式,當足夠多的記錄共享某些字段時,它可能證明創建新的 SQL 字段或外鍵是合理的。把它想象成一個大腦,當一個人看到足夠多的特定類型的例子時,它就會意識到新的“物理定律”......

The keyword is "efficient". Efficiency is inversely proportional to computational complexity, and so, I assume, you look for new algorithms for joins with unstructured data.

First, the problem is already solved in SQL databases, right? Why not to take a look at the implementation, and take it from there?

Let's say we have raw data as records of JSON (or dictionaries, hashmaps). What you're concerned about then is efficient querying, which is a subject of indexing (query-optimizing, or query algorithms). We routinely index SQL databases into ElasticSearch, because SQL databases are not good or not flexible enough in text search in other ways that users care about: we use another data system, that is good at it, and keep a copy of data in there. Not very space-saving, but works. We could do the same with NoSQL -- if you need join-like queries -- just "index" data into SQL databases, by specialized processes, that interprets and migrates SQL database on the fly, working as a complementary job in concert with the NoSQL, always looking for new fields, and creating those fields in the complementary SQL database. Sure, using many databases at once is not an elegant solution, so I agree, that we need improvement of document based databases. After all, schemas are not non-existent, every record implies a schema some sort, and when sufficiently many records share certain fields, it may justify creation of new SQL field or foreign key. Think of it like a brain that realizes new "laws of physics" when one sees sufficiently many examples of a specific type...

我爲 JSON 設計了一個鍵空間,它可以快速解碼回 JSON,並且可以在 RocksDB 鍵值數據庫範圍掃描中快速掃描。


這個 JSON { “名稱”:“塞繆爾·斯奎爾”,

“工作”: {

“當前工作”:{“公司”: {"employeeCount": 2500}}



{"_id": "1",




] } 至少變成以下keyvalue對象

0.0 = "塞繆爾鄉紳" = "2500"

0.0 = "塞繆爾鄉紳" = “上帝” = "數據庫" = "多計算機系統"













"field people..1.": "LIST",


"field people.*.1": "愛好",

"現場人員。.1..0": "姓名",


"現場人員。*.2.0": "currentJob",

"現場人員。*.2.0.0": "公司",

"field people.*.": "employeeCount",



"field people.*.3": "words",

"field people..3.": "LIST",


"現場人員..3...": "LIST",






I've designed a keyspace for JSON that is fast to decode back into JSON and is fast to scan in a RocksDB keyvalue database range scan.

This lets us do a regular hash join as a relational database does.

This JSONs { "name": "Samuel Squire",

"job": {

"currentJob": {"company": {"employeeCount": 2500}}



{"_id": "1",

"name": "Samuel Squire",

"hobbies": [

{"name": "God"}, {"name": "databases"}, {"name":"multicomputer systems"}

] } Is turned into at least the following keyvalue objects

0.0 = "Samuel Squire" = "2500"

0.0 = "Samuel Squire" = "God" = "databases" = "multicomputer systems"

Essentially form a flat structure of the document with keys.

"type people": "object",

"type people.*": "list",

"type people.*.0": "string",

"type people.*.1": "list",

"type people..1..0": "string",

"type people..1.": "object",

"type people.*.2": "object",

"type people.*.2.0": "object",

"type people.*.2.0.0": "object",

"type people.*.": "number",

"field people.*": "LIST",

"field people..1.": "LIST",

"field people.*.0": "name",

"field people.*.1": "hobbies",

"field people..1..0": "name",

"field people.*.2": "job",

"field people.*.2.0": "currentJob",

"field people.*.2.0.0": "company",

"field people.*.": "employeeCount",

"field people": "people",

"field people.*": "LIST",

"field people.*.3": "words",

"field people..3.": "LIST",

"field people..3..*":"LIST",

"field people..3...": "LIST",

"type people.*.3": "list",

"type people..3.": "list",

"type people..3..*": "list",

"type people..3...": "list",

"type people..3....*": "number"

    : Mindey
    :  -- 
    :  -- 


您將 SQL 數據庫與文檔存儲同步的想法類似於我將 SQL 數據庫與快速鍵值存儲的 dyanamodb 同步的想法。

我想要最好的 NoSQL 性能,但需要 SQL 連接的強大功能。

Your idea of synchronizing a SQL database with a document store is similar to my thought of synchronizing a SQL database with dyanamodb which is a fast keyvalue store.

I want the best of NoSQL performance but the power of SQL joins.

    : Mindey
    :  -- 
    :  --