JP6416194B2

JP6416194B2 - Scalable analytic platform for semi-structured data

Info

Publication number: JP6416194B2
Application number: JP2016503110A
Authority: JP
Inventors: ツィロギアニス，ディミトリオス; ビンカート，ネイサン・エイ; ハリゾポウロス，スタヴロス; シャー，メフル・エイ; ソウェル，ベンジャミン・エイ; カプラン，ブライアン・ディ; マイヤー，ケヴィン・アール
Original assignee: アマゾン・テクノロジーズ・インコーポレーテッド
Priority date: 2013-03-15
Filing date: 2014-03-14
Publication date: 2018-10-31
Anticipated expiration: 2034-03-14
Also published as: WO2014144889A2; CA3078018A1; CA2906816A1; US20140279834A1; US10983967B2; US20170206256A1; CN105122243B; JP2016519810A; US9613068B2; US20140279838A1; EP2973051A4; CA3078018C; CN105122243A; EP2973051A2; CA2906816C; US10275475B2; WO2014144889A3

Description

Cross-reference to related applications

本開示は、２０１４年３月１４日出願のＵＳ特許出願番号１４／２１３，９４１に基づいて優先権を主張し、２０１３年３月１５日出願のＵＳ仮特許出願番号６１／８００，４３２に基づいて利益を主張する。上記出願の全ての内容を、援用により本明細書に組み込むものとする。 This disclosure claims priority based on US Patent Application No. 14 / 213,941, filed March 14, 2014, and based on US Provisional Patent Application No. 61 / 800,432, filed March 15, 2013. Claim profits. The contents of all of the above applications are incorporated herein by reference.

本開示は、スケーラブルな対話型データベースプラットフォームに関し、より詳細には、記憶や計算を組み込んだ半構造データのためのスケーラブルな対話型データベースプラットフォームに関する。 The present disclosure relates to scalable interactive database platforms, and more particularly to scalable interactive database platforms for semi-structured data incorporating storage and computation.

本明細書における背景技術の記載は、開示の文脈を一般的に示すためである。ここに現在名を挙げた発明者らの研究については、この背景技術の欄に記載の範囲まで、及び、出願時に先行技術として見なされない可能性のある背景技術の記載の態様は、本開示に対する先行技術として明示的にも暗示的にも認められない。 The background description provided herein is for the purpose of generally presenting the context of the disclosure. With respect to the researches of the inventors currently named, the embodiments described in the background art to the extent described in this background art and the background art that may not be regarded as prior art at the time of filing are disclosed No prior art is explicitly or implicitly admitted as prior art.

従来のデータベースシステムは、基礎的ストレージバックエンドと緊密に一体化されたクエリ実行エンジンを特徴とする。基礎的ストレージバックエンドは、典型的には、ブロックアドレス指定可能な、計算能力を持たない永続記憶装置からなる。これらの装置（ハードディスクドライブ及び／または半導体ドライブ）の特徴は、（ａ）データに連続的にアクセスするか、ランダムにアクセスするかによって、大きく異なるアクセスタイム（ｂ）ブロックの粒度で設定された、一定の最小サイズを有するアクセスユニット（ｃ）メインメモリより(桁が違うほど)大幅に遅いアクセスタイム、である。これらの特徴は、ストレージバックエンドが、非自明な（non-trivial）計算の能力を持たないという前提と併せて、ストレージ管理からクエリ実行、クエリ最適化まで、データベースシステムの設計に重要な影響を与えてきた。 Conventional database systems feature a query execution engine that is tightly integrated with the underlying storage backend. The underlying storage backend typically consists of block-addressable, non-computational, persistent storage. The characteristics of these devices (hard disk drive and / or semiconductor drive) are set at granularity of largely different access time (b) blocks depending on whether (a) data is accessed continuously or randomly. Access unit (c) access time with a certain minimum size (separate digits) is much slower than the main memory. These features, together with the premise that the storage backend does not have non-trivial computing capabilities, have a significant impact on database system design, from storage management to query execution to query optimization. Have given.

データベースは、本来、日々の企業活動を管理する操作可能なストアの役割を果たすものであった。データベース技術が、性能及びコストの両面で改善されるにつれて、企業においては、その後の分析のために、ますます大量の操作履歴と業務状態を保管することが必要になった。このような分析は、企業が、自社のプロセスを洞察し、最適化するのを支援して、競争力を与え、利益を増加させる。 The database originally played the role of an operable store that manages daily business activities. As database technology has improved in both performance and cost, companies have become required to store increasingly large amounts of operation history and business state for later analysis. Such analysis helps companies to insight and optimize their processes, give them a competitive edge and increase their profits.

データウェアハウジングは、これらの需要によって生じた。ビジネスデータは、構造化が進んでいることが多く、関係テーブルに容易に適合する。データウェアハウスは、本来、スケーラブルな関係データベースシステムで、このビジネスデータのオフライン分析に構造化クエリ言語（ＳＱＬ：ｓｔｒｕｃｔｕｒｅｄｑｕｅｒｙｌａｎｇｕａｇｅ）を提供する、また、ほとんどが読み取り用の作業負荷のために最適化されている。例えば、データウェアハウスは、Ｔｅｒａｄａｔａ等の従来のシステムや、Ｖｅｒｔｉｃａ、Ｇｒｅｅｎｐｌｕｍ、ＡｓｔｅｒＤａｔａ等の新しいベンダを含む。それらはＳＱＬインタフェース、インデックス、及び、高速のカラムアクセスを提供する。 Data warehousing arose from these demands. Business data is often structured and fits easily into relational tables. Data Warehousing is essentially a scalable relational database system that provides structured query language (SQL) for off-line analysis of this business data, and most are optimized for read workloads It is done. For example, data warehousing includes conventional systems such as Teradata, and new vendors such as Vertica, Greenplum, Aster Data, and the like. They provide SQL interfaces, indexes, and fast column access.

典型的に、データウェアハウスには、様々なソースや操作システムから採集したデータが、例えば、毎晩または毎週など、定期的にロードされる。このデータから不要なデータを取り除き、キュレートし、１つのスキーマに統合し、ウェアハウスにロードするプロセスは、抽出、変換、ロード（ＥＴＬ）として知られている。ソースとデータの種類が増えるにつれて、ＥＴＬプロセスの複雑さも増す。適切なスキーマを定義し、入力データを所定のスキーマに一致させることを含む、ＥＴＬを成功裏に実施することは、専門家にとって数週間も数か月もかかることであり、また、変更を行うことは難しいまたは不可能な場合がある。ＥＴＬプロセスを支援するための、Ａｂｉｎｉｔｉｏ、Ｉｎｆｏｒｍａｔｉｃａ、Ｐｅｎｔａｈｏ等の多くのツールが市場に出ている。にもかかわらず、ＥＴＬプロセスは、一般に、面倒で、脆弱で、高価なままである。 Typically, data warehousing is periodically loaded with data collected from various sources and operating systems, such as every night or every week. The process of removing unnecessary data from this data, curating it, consolidating it into one schema, and loading it into the warehouse is known as Extract, Transform, Load (ETL). As the types of sources and data increase, the complexity of the ETL process also increases. Successful implementation of ETL, including defining the appropriate schema and matching the input data to the given schema, can take weeks or months for the expert, and make changes Things can be difficult or impossible. Many tools on the market, such as Abinitio, Informatica, Pentaho, etc., support the ETL process. Nevertheless, ETL processes generally remain cumbersome, fragile and expensive.

データ分析市場は、ビジネスユーザが、アドホックに、反復してウェアハウスのデータを分析するのを容易にする多くのビジネスインテリジェンス及び視覚化ツールで溢れている。ビジネスインテリジェンスツールは、ウェアハウスデータの多次元の集合体を構築し、ユーザが、そのデータをナビゲートして、そのデータの様々なスライス及び投影を見るのを可能にする。例えば、ビジネスユーザは、製品カテゴリ、地域、店舗ごとの月間総売上が知りたいかもしれない。また、特定のカテゴリの週間売上を掘り下げたい場合もあり、全て集めて全国の売上を知りたいかもしれない。多次元の集合体は、オンライン分析処理（ＯＬＡＰ）キューブとも称してよい。ＢｕｓｉｎｅｓｓＯｂｊｅｃｔｓやＣｏｇｎｏｓ等の多くのビジネスインテリジェンス（ＢＩ）ツールは、このような分析を可能にし、キューブにクエリを行うための多次元式（ＭＤＸ）と呼ばれる言語をサポートする。また、ビジネスユーザがこれらのキューブやデータウェアハウスに直観的にナビゲートできるようにする、ＭｉｃｒｏＳｔｒａｔｅｇｙ、Ｔａｂｌｅａｕ、Ｓｐｏｔｆｉｒｅ等の多くの視覚化ツールがある。 The data analytics market is flooded with many business intelligence and visualization tools that facilitate business users in an ad hoc, iteratively analyzing warehouse data. Business intelligence tools build multi-dimensional collections of warehouse data, enabling users to navigate the data and view various slices and projections of the data. For example, a business user may want to know the product category, region, monthly total sales per store. You may also want to drill down into weekly sales in a particular category, and you may want to collect them all to see sales across the country. A multi-dimensional collection may also be referred to as an online analytical processing (OLAP) cube. Many Business Intelligence (BI) tools, such as Business Objects and Cognos, allow such analysis and support a language called Multidimensional Expressions (MDX) for querying cubes. There are also many visualization tools, such as MicroStrategy, Tableau, Spotfire, etc., that allow business users to intuitively navigate to these cubes and data warehouses.

最近になって、企業が分析したいデータの種類が変わってきた。従来の実店舗だけによる企業が、オンライン化して、新しいオンラインビジネス形態になると、それらの企業は、ＧｏｏｇｌｅやＹａｈｏｏ等のトップ企業に殺到している種類のデータを分析する必要がある。そのデータには、ウェブページ、ページビューのログ、クリックストリーム、リッチサイトサマリー（ＲＳＳ）フィード、アプリケーションログ、アプリケーションサーバログ、システムログ、トランザクションログ、センサデータ、ソーシャルネットワークフィード、ニュースフィード、ブログポスト等のデータが含まれる。 Recently, the types of data that companies want to analyze have changed. When companies with only traditional stores become online and become a new online business form, those companies need to analyze the types of data that are rushing to top companies such as Google and Yahoo. Its data includes web pages, pageview logs, clickstreams, rich site summary (RSS) feeds, application logs, application server logs, system logs, transaction logs, sensor data, social network feeds, news feeds, blog posts, etc. Contains the data of

これらの半構造データは、従来のウェアハウスにあまり適合しない。これらの半構造データは、ある特有の構造を有し、その構造には一貫性がない場合がある。その構造は、経時的に素早く変化し得るし、ソースに応じて異なり得る。構造は、当然ながら、表形式ではないので、ユーザがこれらのデータに対して実行したい、クラスタリング、分類、予測などの分析は、ＳＱＬで表すのは容易ではない。これらのデータを有効に利用するための既存のツールは、面倒で、不十分である。 These semi-structured data do not fit well into traditional warehouses. These semi-structured data have certain unique structures, which may not be consistent. The structure can change rapidly over time and can differ depending on the source. Since the structure is, of course, not tabular, analysis that the user wants to execute on these data, such as clustering, classification, prediction, etc. is not easy to express in SQL. Existing tools to make effective use of these data are cumbersome and inadequate.

結果として、新しい高度にスケーラブルなストレージ及び分析プラットフォーム、Ｈａｄｏｏｐが現れた。Ｈａｄｏｏｐは、ウェブクロールや検索を管理するためにＧｏｏｇｌｅで実施されている技術から着想を得たものである。要するに、Ｈａｄｏｏｐは、データを確実に記憶するためのクラスタファイルシステムであるＨａｄｏｏｐ分散ファイルシステム（ＨＤＦＳ：ＨａｄｏｏｐＤｉｓｔｒｉｂｕｔｅｄＦｉｌｅＳｙｓｔｅｍ）と、より複雑な分析をサポートする基本的な平行分析エンジンＭａｐＲｅｄｕｃｅと、を提供する。これらの要素から始まって、Ｈａｄｏｏｐエコシステムは、インデックス付のオペレーショナルストアであるＨＢａｓｅや、ＭａｐＲｅｄｕｃｅに依存する新しいクエリインタフェースであるＰｉｇやＨｉｖｅを備えるようになった。 As a result, Hadoop, a new, highly scalable storage and analysis platform, has emerged. Hadoop is inspired by technology implemented at Google to manage web crawling and search. In short, Hadoop provides Hadoop Distributed File System (HDFS), which is a cluster file system for securely storing data, and a basic parallel analysis engine MapReduce that supports more complex analysis. Do. Starting with these elements, the Hadoop ecosystem now includes HBase, an indexed operational store, and Pig and Hive, new query interfaces that rely on MapReduce.

Ｈｉｖｅは、クエリ最適化、キャッシュ、及び、インデックス付けのために従来のウェアハウスで見られた最適化なしに、Ｈａｄｏｏｐにクエリ層を追加するＡｐａｃｈｅプロジェクトである。その代わり、Ｈｉｖｅは、単に、（Ｈｉｖｅ−ＱＬと呼ばれる)ＳＱＬのような言語のクエリをＭａｐＲｅｄｕｃｅジョブに変え、Ｈａｄｏｏｐクラスタに対して実行する。従来のビジネスユーザにとって、Ｈｉｖｅには３つの大きな問題がある。Ｈｉｖｅは、標準的ＳＱＬをサポートしておらず、動的スキーマも持たない。さらに、各Ｈｉｖｅクエリは、全てのソースデータを再度、構文解析するＭａｐＲｅｄｕｃｅジョブを必要とし、また、ソースデータを複数回パスすることが必要な場合が多いので、Ｈｉｖｅは、対話型クエリを可能にするほど高速ではない。 Hive is an Apache project that adds a query layer to Hadoop without the optimizations found in traditional warehouses for query optimization, caching and indexing. Instead, Hive simply turns queries in languages like SQL (called Hive-QL) into MapReduce jobs and executes them against Hadoop clusters. Hive has three major problems for traditional business users. Hive does not support standard SQL, nor does it have a dynamic schema. In addition, Hive enables interactive queries, as each Hive query requires a MapReduce job to parse all source data again, and often it is necessary to pass the source data multiple times. It's not as fast as it does.

Ｉｍｐａｌａは、Ｃｌｏｕｄｅｒａ社のＨａｄｏｏｐの実装における、Ｈｉｖｅ−ＱＬクエリのためのリアルタイムエンジンである。Ｉｍｐａｌａは、Ｈｉｖｅのシーケンスファイルの分析を提供する、また、最終的には、ネストされたモデルをサポートし得る。しかしながら、Ｉｍｐａｌａは、動的スキーマを持たず、ユーザは、クエリするデータについて、前もってスキーマを提供する必要がある。 Impala is a real-time engine for Hive-QL queries in Cloudera's Hadoop implementation. Impala provides analysis of Hive's sequence files and may ultimately support nested models. However, Impala does not have a dynamic schema, and the user needs to provide the schema in advance for the data to be queried.

Ｐｉｇは、別のＡｐａｃｈｅプロジェクトで、Ｈａｄｏｏｐのログファイルの処理のためのスキーマフリーのスクリプト言語を提供する。Ｐｉｇは、Ｈｉｖｅ同様、全てをマップリデュースジョブに翻訳する。同様に、Ｐｉｇは、インデックスを利用せず、対話性を求めるには速さが十分でない。 Pig is another Apache project that provides a schema-free scripting language for processing Hadoop log files. Pig, like Hive, translates everything into a map reduce job. Similarly, Pig does not use indexes and is not fast enough to seek interactivity.

Ｊａｑｌは、ジェイソン（ＪＳＯＮ：ＪａｖａＳｃｒｉｐｔ（登録商標）ＯｂｊｅｃｔＮｏｔａｔｉｏｎ）ログを分析するための（ＳＱＬのような宣言型言語とは違った)スキーマフリーの宣言型言語である。Ｐｉｇ同様、Ｊａｑｌは、Ｈａｄｏｏｐでマップリデュースプログラムにコンパイルされるので、対話に適したスピードでないことを含めて、同じ欠点を多く共有する。 Jaql is a schema-free declarative language (as opposed to a declarative language like SQL) for analyzing Jason (JSON: JavaScript Object Notation) logs. Like Pig, Jaql is compiled into a MapReduce program with Hadoop, so it shares many of the same drawbacks, including not being suitable speed for interaction.

Ｈａｄｏｏｐ自体は、かなり速く普及しつつあり、クラウドで容易に手に入る。Ａｍａｚｏｎは、弾力的なマップリデュースを提供しており、クラウドで実行するＨａｄｏｏｐのＭａｐＲｅｄｕｃｅの実装と同じくらい有効であると思われる。Ａｍａｚｏｎのクラウドベースのシンプルストレージサービス（Ｓ３：ＳｉｍｐｌｅＳｔｏｒａｇｅＳｅｒｖｉｃｅ)に記憶されたデータを処理し、結果をＳ３に出力する。 Hadoop itself is getting pretty fast, and is easily available in the cloud. Amazon offers elastic map reduction, and appears to be as effective as Hadoop's MapReduce implementation running in the cloud. Process the data stored in Amazon's cloud-based simple storage service (S3: Simple Storage Service) and output the result to S3.

Ｈａｄｏｏｐエコシステムの長所は、３つある。一つ目は、システムは、非常に大きいサイズにスケールされるので、任意のデータ型を記憶できる。二つ目は、従来のウェアハウスに比較して非常にコストが安い(２０分の一ほど)。三つ目は、オープンソースなので、単一のベンダに拘束されない。ユーザは、適切なジョブに適切なツールを選ぶ能力を欲しており、システム間のデータ移動を避けて、ジョブを完了したい。Ｈａｄｏｏｐはフレキシブルではあるが、Ｈａｄｏｏｐの使用には、深い知識を持った熟練したアドミニストレータ及びプログラマが必要であり、見つけるのが難しい。さらに、Ｈａｄｏｏｐは、対話には遅すぎる。最も簡単なクエリでさえ、実行に数分から数時間かかる。 The advantages of the Hadoop ecosystem are threefold. First, the system is scaled to a very large size, so it can store any data type. The second one is very cheap (about 20 times lower) than conventional warehouses. Third, because it is open source, it is not tied to a single vendor. Users want the ability to choose the right tool for the right job, and want to complete the job, avoiding data movement between systems. While Hadoop is flexible, using Hadoop requires skilled administrators and programmers with deep knowledge and is difficult to find. Furthermore, Hadoop is too late to talk. Even the simplest queries take minutes to hours to execute.

Ｄｒｅｍｍｅｌは、Ｇｏｏｇｌｅ内部で開発されたツールで、ネスト関係または半構造データに対するＳＱＬベースの分析クエリを提供する。元のバージョンは、ＰｒｏｔｏＢｕｆフォーマットのデータを扱っていた。Ｄｒｅｍｍｅｌでは、ユーザは、全てのレコードについて、前もってスキーマを定義することが必要である。ＢｉｇＱｕｅｒｙは、Ｄｒｅｍｍｅｌをクラウドベースに商業化したもので、ＣＳＶフォーマットやＪＳＯＮフォーマットを取り扱えるように拡張されている。Ｄｒｉｌｌは、Ｄｒｅｍｍｅｌのオープンソースバージョンである。 Dremmel is a tool developed internally at Google that provides SQL-based analytic queries on nested relationships or semi-structured data. The original version was dealing with data in ProtoBuf format. In Dremmel, the user needs to define a schema in advance for every record. BigQuery is a commercialization of Dremmel as a cloud base, and has been extended to handle CSV and JSON formats. Drill is an open source version of Dremmel.

Ａｓｔｅｒｉｘは、ＪＳＯＮの一般化である抽象データモデル（ＡＤＭ）及び注釈クエリ言語（ＡＱＬ：ａｎｎｏｔａｔｉｏｎｑｕｅｒｙｌａｎｇｕａｇｅ）を使用する半構造データを管理、分析するためのシステムである。Ａｓｔｅｒｉｘは、標準ＳＱＬをサポートせず、本開示によってもたらされる高速アクセスも有さない。 Asterix is a system for managing and analyzing semi-structured data using an abstract data model (ADM), which is a generalization of JSON, and an annotation query language (AQL). Asterix does not support standard SQL, nor does it have the fast access provided by the present disclosure.

データ変換システムは、スキーマ推論モジュールとエクスポートモジュールとを含む。スキーマ推論モジュールは、第１のデータソースから取り出したオブジェクトについて累積スキーマを動的に作成するように構成される。取り出した各オブジェクトは、（ｉ）データと、（ｉｉ）そのデータを記述するメタデータとを含む。累積スキーマの動的な作成は、取り出したオブジェクトの各オブジェクトについて、（ｉ）オブジェクトからスキーマを推論すること（ｉｉ）推論したスキーマに従ってオブジェクトを記述するように累積スキーマを選択的に更新すること、を含む。エクスポートモジュールは、取り出したオブジェクトのデータを、累積スキーマに従ってデータ宛先システムに出力するように構成される。 The data conversion system includes a schema inference module and an export module. The schema inference module is configured to dynamically create a cumulative schema for objects retrieved from the first data source. Each extracted object includes (i) data and (ii) metadata describing the data. Dynamic creation of the cumulative schema comprises, for each object of the retrieved object, (i) inferring the schema from the object (ii) selectively updating the cumulative schema to describe the object according to the inferred schema, including. The export module is configured to output the data of the retrieved object to the data destination system according to the cumulative schema.

他の特徴としては、データ宛先システムは、データウェアハウスを含む。他の特徴としては、データウェアハウスは、関係データを記憶する。他の特徴としては、エクスポートモジュールは、累積スキーマを関係スキーマに変換し、関係スキーマに従って、取り出したオブジェクトのデータをデータウェアハウスに出力するように構成される。他の特徴としては、エクスポートモジュールは、関係スキーマに行われた変更を反映するようにデータウェアハウスのスキーマを更新するようにというデータウェアハウスへのコマンドを生成するように構成される。 In other features, the data destination system includes a data warehouse. In other features, the data warehouse stores relational data. In other features, the export module is configured to convert the cumulative schema into a relational schema and output data of the retrieved object to the data warehouse according to the relational schema. In other features, the export module is configured to generate a command to the data warehouse to update the data warehouse schema to reflect the changes made to the relationship schema.

他の特徴としては、エクスポートモジュールは、関係スキーマに従って、取り出したオブジェクトのデータから、少なくとも１つの中間ファイルを作成するように構成される。他の特徴としては、少なくとも１つの中間ファイルは、所定のデータウェアハウスフォーマットを有する。他の特徴としては、エクスポートモジュールは、少なくとも１つの中間ファイルをデータウェアハウスにバルクロードするように構成される。他の特徴としては、インデックスストアは、取り出したオブジェクトからのデータをカラム形式で記憶するように構成される。他の特徴としては、エクスポートモジュールは、インデックスストアに記憶されたデータからローベースのデータを生成するように構成される。他の特徴としては、スキーマ推論モジュールは、インデックスストア内に、取り出したオブジェクトの識別子に時間値をマップする時間インデックスを作成するように構成される。 In other features, the export module is configured to create at least one intermediate file from the data of the retrieved object in accordance with the relationship schema. In another feature, the at least one intermediate file has a predetermined data warehouse format. In other features, the export module is configured to bulk load at least one intermediate file into the data warehouse. In other features, the index store is configured to store data from retrieved objects in a column format. In other features, the export module is configured to generate raw based data from data stored in the index store. In other features, the schema inference module is configured to create a time index in the index store that maps time values to identifiers of retrieved objects.

他の特徴としては、取り出したオブジェクトの各オブジェクトについて、時間値は（ｉ）取り出したオブジェクトの作成に該当するトランザクション時間、または、（ｉｉ）取り出したオブジェクトに該当する有効時間の少なくとも１つを示す。他の特徴としては、書き込み最適化ストアは、（ｉ）後にインデックスストアに記憶するために追加のオブジェクトをキャッシュし、（ｉｉ）キャッシュサイズが閾値に達したことに応答して、インデックスストアにバルクロードするために、追加のオブジェクトをパッケージ化するように、構成される。他の特徴としては、スキーマ推論モジュールは、取り出したオブジェクトのメタデータに関する統計を収集するように構成される。他の特徴としては、スキーマ推論モジュールは、取り出したオブジェクトのデータ型に関する統計を収集するように構成される。他の特徴としては、スキーマ推論モジュールは、データ型に関する統計に応答して、取り出したオブジェクトの一部のデータを型変換するように構成される。 In other features, for each object of the retrieved object, the time value indicates at least one of (i) transaction time corresponding to creation of the extracted object, or (ii) valid time corresponding to the extracted object . Among other features, the write optimization store (i) caches additional objects for later storage in the index store, and (ii) bulks the index store in response to the cache size reaching a threshold. Configured to package additional objects for loading. In other features, the schema inference module is configured to collect statistics on metadata of retrieved objects. In other features, the schema inference module is configured to collect statistics on data types of retrieved objects. In other features, the schema inference module is configured to cast data of a portion of the retrieved object in response to the statistics on the data types.

他の特徴としては、スキーマ推論モジュールは、データ型の統計に応答して、取り出したオブジェクトの一部のデータを不正確に型付けされている可能性があるとしてユーザに報告するように構成される。他の特徴としては、スキーマ推論モジュールは、取り出したオブジェクトのデータに関する統計を収集するように構成される。他の特徴としては、統計は、最小、最大、平均、及び、標準偏差の少なくとも１つを含む。他の特徴としては、データコレクタモジュールは、第１のデータソースから関係データを受信し、スキーマ推論モジュールが使用するオブジェクトを生成するように構成される。他の特徴としては、データコレクタモジュールは、（ｉ）関係データの各項目を取り出すためのテーブルを示す第１のカラムと、（ｉｉ）関係データの各項目に関連付けられたタイムスタンプを示す第２のカラムとを作成することによって、関係データをイベント化するように構成される。 In other features, the schema inference module is configured to, in response to data type statistics, report to the user that data of a portion of the retrieved object may be incorrectly typed . In other features, the schema inference module is configured to collect statistics on data of retrieved objects. In other features, the statistics include at least one of minimum, maximum, average, and standard deviation. In other features, the data collector module is configured to receive relational data from the first data source and generate an object for use by the schema inference module. In other features, the data collector module includes (i) a first column indicating a table for retrieving each item of relationship data, and (ii) a second column indicating a timestamp associated with each item of relationship data. Are configured to make relational data into an event by creating a column of.

他の特徴としては、スケジューリングモジュールは、所定の依存情報に従って、ジョブの処理をスキーマ推論モジュールとエクスポートモジュールに割り当てるように構成される。他の特徴としては、エクスポートモジュールは、累積スキーマを複数のテーブルにパーティション化するように構成される。他の特徴としては、複数のテーブルの各テーブルは、取り出したオブジェクト内に一緒に現れるカラムを含む。他の特徴としては、エクスポートモジュールは、識別子要素について異なる値を有する、取り出したオブジェクトのグループに該当するカラムに従って、累積スキーマをパーティション化するように構成される。他の特徴としては、スキーマ推論モジュールは、取り出した各オブジェクトのソース識別子を記録する。他の特徴としては、取り出したオブジェクトの各オブジェクトについて、ソース識別子は、第１のデータソースの固有の識別子と、第１のデータソース内のオブジェクトの位置とを含む。 In other features, the scheduling module is configured to assign processing of the job to the schema inference module and the export module according to the predetermined dependency information. In other features, the export module is configured to partition the cumulative schema into a plurality of tables. In other features, each table of the plurality of tables includes columns that appear together in the retrieved object. In other features, the export module is configured to partition the cumulative schema according to columns corresponding to the group of retrieved objects having different values for the identifier element. In other features, the schema inference module records a source identifier for each object retrieved. In other features, for each object of the retrieved object, the source identifier includes the unique identifier of the first data source and the location of the object in the first data source.

データ変換システム操作方法は、第１のデータソースから取り出したオブジェクトについて動的に累積スキーマを作成することを含む。取り出した各オブジェクトは、（ｉ）データと（ｉｉ）そのデータを記述するメタデータとを含む。累積スキーマの動的な作成は、取り出したオブジェクトの各オブジェクトについて、（ｉ）オブジェクトからスキーマを推論することと（ｉｉ）推論したスキーマに従ってオブジェクトを記述するように累積スキーマを選択的に更新すること、を含む。方法は、累積スキーマに従って、取り出したオブジェクトのデータをデータ宛先システムに出力することをさらに含む。 The data transformation system operation method includes dynamically creating a cumulative schema for objects retrieved from the first data source. Each extracted object includes (i) data and (ii) metadata describing the data. Dynamic creation of the cumulative schema involves, for each object of the retrieved object, (i) inferring the schema from the object and (ii) selectively updating the cumulative schema to describe the object according to the inferred schema. ,including. The method further includes outputting the data of the retrieved object to a data destination system according to the cumulative schema.

他の特徴としては、データ宛先システムは、データウェアハウスを含む。他の特徴としては、データウェアハウスは、関係データを記憶する。方法は、累積スキーマを関係スキーマに変換することと、関係スキーマに従って、取り出したオブジェクトのデータをデータウェアハウスに出力することと、をさらに含む。方法は、関係スキーマに行われた変更を反映するようにデータウェアハウスのスキーマを更新するようにというデータウェアハウスへのコマンドを生成することをさらに含む。 In other features, the data destination system includes a data warehouse. In other features, the data warehouse stores relational data. The method further includes converting the cumulative schema to a relation schema, and outputting the retrieved object data to the data warehouse according to the relation schema. The method further includes generating a command to the data warehouse to update the data warehouse schema to reflect the changes made to the relationship schema.

方法は、関係スキーマに従って、取り出したオブジェクトのデータから少なくとも１つの中間ファイルを作成することをさらに含む。他の特徴としては、少なくとも１つの中間ファイルは、所定のデータウェアハウスフォーマットを有する。方法は、少なくとも１つの中間ファイルをデータウェアハウスにバルクロードすることをさらに含む。方法は、取り出したオブジェクトからのデータを、カラム形式でインデックスストアに記憶することをさらに含む。方法は、インデックスストアに記憶されたデータからローベースのデータを生成することをさらに含む。 The method further includes creating at least one intermediate file from the data of the retrieved object according to the relationship schema. In another feature, the at least one intermediate file has a predetermined data warehouse format. The method further includes bulk loading at least one intermediate file into the data warehouse. The method further includes storing data from the retrieved object in a column format in an index store. The method further includes generating raw based data from data stored in the index store.

方法は、インデックスストアに、取り出したオブジェクトの識別子に時間値をマップする時間インデックスを作成することをさらに含む。他の特徴としては、取り出したオブジェクトの各オブジェクトについて、時間値は、（ｉ）取り出したオブジェクトの作成に該当するトランザクション時間、または、（ｉｉ）取り出したオブジェクトに該当する有効時間の少なくとも１つを示す。 The method further includes creating in the index store a time index that maps time values to identifiers of the retrieved objects. In other features, for each object of the retrieved object, the time value may be at least one of (i) transaction time corresponding to creation of the extracted object, or (ii) valid time corresponding to the extracted object Show.

方法は、後にインデックスストアに記憶するために追加のオブジェクトをキャッシュすることと、キャッシュサイズが閾値に達したことに応答して、インデックスストアへのバルクロードのために追加のオブジェクトをパッケージ化すること、とをさらに含む。方法は、取り出したオブジェクトのメタデータに関する統計を収集することをさらに含む。方法は、取り出したオブジェクトのデータ型に関する統計を収集することをさらに含む。 The method caches additional objects for later storage in the index store, and packages additional objects for bulk loading into the index store in response to the cache size reaching a threshold And further including. The method further includes collecting statistics on the metadata of the retrieved object. The method further includes collecting statistics regarding data types of the retrieved object.

方法は、データ型に関する統計に応答して、取り出したオブジェクトの一部のデータを型変換することをさらに含む。方法は、データ型に関する統計に応答して、取り出したオブジェクトの一部のデータを不正確に型付けされている可能性があるとしてユーザに報告することをさらに含む。方法は、取り出したオブジェクトのデータに関する統計を収集することをさらに含む。他の特徴としては、統計は、最小、最大、平均、及び、標準偏差の少なくとも１つを含む。 The method further includes retyping part of the data of the retrieved object in response to the statistics on the data type. The method further includes, in response to the data type statistics, reporting to the user the data of a portion of the retrieved object as potentially incorrect. The method further includes collecting statistics on data of the retrieved object. In other features, the statistics include at least one of minimum, maximum, average, and standard deviation.

方法は、第１のデータソースから関係データを受信することと、動的作成によって、使用するオブジェクトを生成することとを、さらに含む。方法は、（ｉ）関係データの各項目を取り出すためのテーブルを示す第１のカラムと、（ｉｉ）関係データの各項目に関連付けられたタイムスタンプを示す第２のカラムとを作成することによって、関係データをイベント化することをさらに含む。方法は、所定の依存情報に従って、動的な作成及びエクスポートに該当するジョブの処理を割り当てることをさらに含む。 The method further includes receiving the relational data from the first data source and generating the object to use by dynamic creation. The method comprises (i) creating a first column indicating a table for retrieving each item of relationship data, and (ii) a second column indicating a time stamp associated with each item of relationship data. , Further including eventing relational data. The method further includes assigning the processing of the corresponding job to the dynamic creation and export according to the predetermined dependency information.

方法は、累積スキーマを複数のテーブルにパーティション化することをさらに含む。他の特徴としては、複数のテーブルの各テーブルは、取り出したオブジェクトに一緒に現れるカラムを含む。方法は、識別子要素について、それぞれ、異なる値を有する取り出したオブジェクトの対応するグループで見つけられるカラムに従って、累積スキーマをパーティション化することをさらに含む。方法は、取り出した各オブジェクトのソース識別子を記憶することをさらに含む。他の特徴としては、取り出したオブジェクトの各オブジェクトについて、ソース識別子は、第１のデータソースの固有の識別子と、第１のデータソース内のオブジェクトの位置とを含む。 The method further includes partitioning the cumulative schema into a plurality of tables. In other features, each table of the plurality of tables includes a column that appears together with the retrieved object. The method further includes partitioning the cumulative schema according to columns found in corresponding groups of retrieved objects having different values, respectively, for the identifier element. The method further includes storing a source identifier of each retrieved object. In other features, for each object of the retrieved object, the source identifier includes the unique identifier of the first data source and the location of the object in the first data source.

データ分析システム操作方法は、データソースからオブジェクトを取り出すことを含む。取り出した各オブジェクトは、（ｉ）データと、（ｉｉ）そのデータを記述するメタデータとを含む。方法は、取り出したオブジェクトの各オブジェクトについて、（ｉ）オブジェクトのメタデータとオブジェクトのデータの要素の推論されたデータ型とに基づいて、オブジェクトからスキーマを推論することと、（ｉｉ）（ａ）推論されたスキーマによって記述されたオブジェクトと、（ｂ）累積スキーマによって記述された累積したオブジェクトのセットと、の両方を記述する統合スキーマを作成することと、（ｉｉｉ）統合スキーマを累積スキーマとして記憶することとによって、累積スキーマを動的に作成することをさらに含む。方法は、取り出した各オブジェクトのデータをデータウェアハウスにエクスポートすることをさらに含む。 A data analysis system operating method includes retrieving objects from a data source. Each extracted object includes (i) data and (ii) metadata describing the data. The method comprises, for each object of the retrieved object, (i) deducing the schema from the object based on the object's metadata and the inferred data type of the element of the object's data; (ii) (a) Creating an integrated schema that describes both the objects described by the inferred schema and (b) the accumulated set of objects described by the cumulative schema; (iii) storing the integrated schema as a cumulative schema Further comprising dynamically creating a cumulative schema by: The method further includes exporting the data of each retrieved object to a data warehouse.

他の特徴としては、方法は、累積スキーマを関係スキーマに変換することをさらに含み、エクスポートは、関係スキーマに従って行われる。他の特徴としては、動的な作成は、取り出したオブジェクトを通る第１のパス中に行われ、エクスポートは、取り出したオブジェクトを通る第２のパス中に行われる。他の特徴としては、方法は、取り出した各オブジェクトのデータをインデックスストレージサービスに記憶することをさらに含み、取り出した各オブジェクトのデータは、インデックスストレージサービスからデータウェアハウスにエクスポートされる。 In other features, the method further comprises converting the cumulative schema to a relational schema, and the exporting is performed according to the relational schema. In other features, dynamic creation occurs during a first pass through the retrieved object, and export occurs during a second pass through the retrieved object. In other features, the method further comprises storing data for each retrieved object in the index storage service, wherein data for each retrieved object is exported from the index storage service to the data warehouse.

他の特徴としては、エクスポートは、インデックスストレージサービスから、所定のデータウェアハウスフォーマットを有する中間ファイルを少なくとも１つ作成することと、少なくとも１つの中間ファイルをデータウェアハウスにバルクロードすることと、を含む。他の特徴としては、方法は、累積スキーマを関係スキーマに変換することをさらに含み、少なくとも１つの中間ファイルは、関係スキーマに従って作成される。他の特徴としては、方法は、グラフィカルユーザインタフェースを介してユーザからクエリを受信することと、（ｉ）インデックスストレージサービスに記憶されたデータと、（ｉｉ）データウェアハウスから返信された結果、の少なくとも１つに基づいて、クエリに応答することと、をさらに含む。 In other features, the export creates from the index storage service at least one intermediate file having a predetermined data warehouse format, and bulk loading at least one intermediate file into the data warehouse. Including. In other features, the method further comprises converting the cumulative schema to a relational schema, wherein at least one intermediate file is created according to the relational schema. In other features, the method includes receiving a query from the user via the graphical user interface, (i) data stored in the index storage service, and (ii) results returned from the data warehouse. Responding to the query based on at least one further.

他の特徴としては、方法は、結果を取得するために、クエリをデータウェアハウスに渡すことをさらに含む。他の特徴としては、方法は、グラフィカルユーザインタフェースを介して最初の結果をユーザに表示することと、クエリの実行が続いている間、グラフィカルユーザインタフェースの結果を反復的に更新することと、をさらに含む。他の特徴としては、方法は、グラフィカルユーザインタフェースを介してユーザからクエリを受信することと、データウェアハウスから返信された結果に基づいてクエリに応答することと、をさらに含む。他の特徴としては、方法は、グラフィカルユーザインターフェースを介してユーザからクエリを受信することと、グラフィカルユーザインタフェースでユーザに最初の結果を表示することと、クエリの実行が続く間、グラフィカルユーザインタフェースで結果を反復的に更新することと、をさらに含む。他の特徴としては、グラフィカルユーザインタフェースで結果を更新することは、少なくとも１つのデータチャートの少なくとも１つの軸のスケーリングを更新することを含む。 In other features, the method further includes passing a query to a data warehouse to obtain a result. In other features, the method displays the initial results to the user through the graphical user interface, and iteratively updates the graphical user interface results while the execution of the query continues. Further include. In other features, the method further includes receiving a query from the user via the graphical user interface and responding to the query based on results returned from the data warehouse. In other features, the method includes receiving a query from the user via the graphical user interface, displaying the first result to the user in the graphical user interface, and executing the query in the graphical user interface. Updating the results iteratively. In other features, updating the results in the graphical user interface includes updating scaling of at least one axis of at least one data chart.

他の特徴としては、方法は、グラフィカルユーザインタフェースを介して累積スキーマをユーザに表示することと、追加データがデータソースから取り出されると、累積スキーマを更新することと、更新された累積スキーマを反映するようにグラフィカルユーザインタフェースを選択的に更新することと、をさらに含む。他の特徴としては、方法は、ユーザインタフェースにおいて、更新された累積スキーマの変更された項目を視覚的に区別することをさらに含む。他の特徴としては、方法は、新しいオブジェクトがデータソースから取得可能になるのに応答して、取り出し、動的な作成、及び、エクスポートを繰り返すことをさらに含む。他の特徴としては、方法は、エクスポートを繰り返す前に、前回のエクスポートの後、累積スキーマが変更されたか否かを決定することと、累積スキーマが変更されたと決定したことに応答して、累積スキーマの変更を反映するようにデータウェアハウスのスキーマを更新するようにという、少なくとも１つのコマンドをデータウェアハウスに送信することとを、さらに含む。 In other features, the method displays the cumulative schema to the user via the graphical user interface, updates the cumulative schema when additional data is retrieved from the data source, and reflects the updated cumulative schema Selectively updating the graphical user interface to do so. In other features, the method further comprises visually distinguishing changed items of the updated cumulative schema in the user interface. In other features, the method further includes repeating retrieval, dynamic creation, and export in response to the new object becoming available from the data source. In other features, the method accumulates in response to determining, after the previous export, whether or not the cumulative schema has changed, and determining that the cumulative schema has changed, before repeating the export. And sending at least one command to the data warehouse to update the data warehouse's schema to reflect the schema change.

本開示は、非一時的コンピュータ可読媒体に記憶された命令として具体化された、上記方法の各特徴をさらに包含する。 The present disclosure further includes the features of the above method embodied as instructions stored on a non-transitory computer readable medium.

詳細な説明及び添付の図面から、より完全に本開示を理解されよう。
クラウドリソースを活用する、半構造データのためのスケーラブルな分析プラットフォームのネットワークアーキテクチャの例である。ユーザエンドのサーバ機器を備えた、半構造データのためのスケーラブルな分析プラットフォームのネットワークアーキテクチャの例である。データウェアハウスを用いた、スケーラブルな分析プラットフォームのネットワークアーキテクチャの例である。サーバシステムを示す機能ブロック図である。半構造データのためのスケーラブルな分析プラットフォームの例を示す機能ブロック図である。データウェアハウスを実装するスケーラブルな分析プラットフォームの例を示す機能ブロック図である。データウェアハウスと、ハイブリッドクエリエグゼキュータとを実装するスケーラブルな分析プラットフォームの例を示す機能ブロック図である。ユーザインタフェースを実装する例を示す機能ブロック図である。半構造データのためのスケーラブルな分析プラットフォームのクエリシステムの例を示す機能ブロック図である。データウェアハウスを用いたクエリシステムの例を示す機能ブロック図である。採集したデータを組み込む方法の例を示すフローチャートである。スキーマ推論方法の例を示すフローチャートである。２つのスキーマをマージする方法の例を示すフローチャートである。スキーマを折り畳む方法の例を示すフローチャートである。インデックスにデータをポピュレートする方法の例を示すフローチャートである。マップ装飾を行う方法の例を示すフローチャートである。ＪＳＯＮスキーマから関係スキーマを作成する方法の例を示すフローチャートである。データウェアハウスを用いた、データ採集プロセスの例を示すフローチャートである。データウェアハウスを用いた、データ採集プロセスの例を示すフローチャートである。データウェアハウスを用いた場合、新しいデータに応答して更新を行う例を示すフローチャートである。ユーザインタフェース操作の例を示すフローチャートである。ユーザインタフェースの実装例を示すスクリーンショットである。ユーザインタフェースの実装例を示すスクリーンショットである。ユーザインタフェースの実装例を示すスクリーンショットである。ユーザインタフェースの実装例を示すスクリーンショットである。ユーザインタフェースの実装例を示すスクリーンショットである。複数のデータの宛先を提供するスケーラブルな分析プラットフォームの例を示す機能ブロック図である。カラム指向のレポジトリからのバルクローエクスポートを示すグラフィック図である。本開示の原理に係る、抽出、変換、ロードプロセスの構成要素を並列化するための依存関係図である。本開示の原理に係る、抽出、変換、ロードプロセスの構成要素を並列化するための依存関係図である。図中、参照番号は、類似及び／または同一の要素を特定するために再使用されてよい。 The disclosure will be more fully understood from the detailed description and the accompanying drawings.
It is an example of a network architecture of a scalable analytic platform for semi-structured data that utilizes cloud resources. FIG. 6 is an example of a network architecture of a scalable analysis platform for semi-structured data with user-end server equipment. It is an example of a network architecture of a scalable analysis platform using a data warehouse. It is a functional block diagram showing a server system. FIG. 6 is a functional block diagram illustrating an example of a scalable analysis platform for semi-structured data. FIG. 1 is a functional block diagram illustrating an example of a scalable analytics platform implementing a data warehouse. FIG. 1 is a functional block diagram illustrating an example of a scalable analytics platform implementing a data warehouse and a hybrid query executor. It is a functional block diagram showing an example which implements a user interface. FIG. 5 is a functional block diagram illustrating an example of a query system of a scalable analysis platform for semi-structured data. It is a functional block diagram showing an example of a query system using a data warehouse. 5 is a flowchart illustrating an example of a method for incorporating collected data. It is a flowchart which shows the example of a schema inference method. Fig. 6 is a flow chart illustrating an example of a method of merging two schemas. 5 is a flow chart illustrating an example of a method of folding a schema. FIG. 7 is a flow chart illustrating an example of a method of populating data in an index. It is a flowchart which shows the example of the method of performing map decoration. It is a flow chart which shows an example of a method of creating a relation schema from JSON schema. Figure 2 is a flow chart illustrating an example of a data collection process using a data warehouse. Figure 2 is a flow chart illustrating an example of a data collection process using a data warehouse. When a data warehouse is used, it is a flowchart which shows the example which updates in response to new data. 5 is a flowchart illustrating an example of user interface operation. It is a screen shot which shows the example of implementation of a user interface. It is a screen shot which shows the example of implementation of a user interface. It is a screen shot which shows the example of implementation of a user interface. It is a screen shot which shows the example of implementation of user interface. It is a screen shot which shows the example of implementation of user interface. FIG. 5 is a functional block diagram illustrating an example of a scalable analytics platform that provides multiple data destinations. FIG. 7 is a graphic diagram illustrating bulk row export from a column oriented repository. FIG. 5 is a dependency diagram for parallelizing components of the extraction, transformation, and loading processes in accordance with the principles of the present disclosure; FIG. 5 is a dependency diagram for parallelizing components of the extraction, transformation, and loading processes in accordance with the principles of the present disclosure; In the figures, reference numbers may be reused to identify similar and / or identical elements.

本開示は、半構造データにクエリを行うための構造化クエリ言語（ＳＱＬ）対応インタフェースを提供できる分析プラットフォームを記載する。説明目的のために、半構造データは、ＪＳＯＮ（ＪａｖａＳｃｒｉｐｔＯｂｊｅｃｔＮｏｔａｔｉｏｎ）フォーマットで表されるが、本開示の原理に従った他の自己記述的な半構造フォーマットも使用することができる。ソースデータは、自己記述的である必要はない。プロトコルバッファの場合のように、記述は、データから切り離すことができる。データにタグを付けるための、規則、ヒューリスティック、または、ラッパー関数がある限り、任意の入力データをＪＳＯＮフォーマットに類似したオブジェクトに変換することができる。 The present disclosure describes an analysis platform that can provide a structured query language (SQL) compliant interface for querying semi-structured data. For illustrative purposes, semi-structured data is represented in JSON (JavaScript Object Notation) format, but other self-descriptive semi-structured formats in accordance with the principles of the present disclosure may also be used. Source data need not be self-describing. As in the case of the protocol buffer, the description can be separated from the data. As long as there are rules, heuristics, or wrapper functions for tagging data, any input data can be converted to an object similar to JSON format.

本開示に係る、分析プラットフォームの様々な実装において、以下の長所の一部またはすべてが実現される。 In various implementations of the analysis platform according to the present disclosure, some or all of the following advantages are realized.

スピード
分析プラットフォームは、アドホックで、探索的な、対話型分析をサポートするための高速クエリ応答時間を提供する。ユーザは、クエリを発行し、その日か翌日に結果を見るために戻る必要なく、このシステムを用いて、データに隠れた洞察を素早く発見することができる。分析プラットフォームは、採集したデータを全てインデックスで記憶するインデックスストアに依存しており、高速応答時間を可能にしている。 The speed analysis platform provides fast query response time to support ad hoc, exploratory, interactive analysis. The user can use this system to quickly find insights hidden in the data, without having to issue a query and go back to see the results that day or the next day. The analysis platform relies on an index store that stores all the collected data in an index, enabling fast response times.

ＢｉｇＩｎｄｅｘ（ＢＩ）とＡｒｒａｙＩｎｄｅｘ（ＡＩ）という２つの主インデックスを使用している。これらのインデックスに関しては、以下に詳しく記載する。これらは、経路インデックスと、カラム指向ストアの中間物である。カラム指向ストアのように、これらのインデックスによって、クエリは関連するフィールドのデータのみを取り出すことができ、Ｉ／Ｏ（入力／出力）需要を減らして、性能を改善する。しかしながら、カラムストアとは違って、これらのインデックスは、複雑なネストされたオブジェクトや、多くのフィールドを持つコレクションに適している。他のアクセスパターンに関して、分析プラットフォームエンジンは、ＶａｌｕｅＩｎｄｅｘ（ＶＩ）を含む補助インデックスを保持している。これについては、以下により詳細に記載する。従来のデータベースインデックスのように、ＶａｌｕｅＩｎｄｅｘは、特定のフィールドの値または値の範囲に、高速な対数的アクセスを提供する。これらのインデックスは、クエリを満足させるために取り出す必要のあるデータを大幅に低減して、応答時間を改善する。 We use two main indexes, BigIndex (BI) and ArrayIndex (AI). These indices are described in more detail below. These are intermediate to the path index and the column oriented store. Like column oriented stores, these indexes allow queries to retrieve only the data of the relevant field, reducing I / O (input / output) demand and improving performance. However, unlike column stores, these indexes are suitable for complex nested objects and collections with many fields. For other access patterns, the analysis platform engine maintains an auxiliary index that includes ValueIndex (VI). This is described in more detail below. Like conventional database indexes, ValueIndex provides fast logarithmic access to the value or range of values of a particular field. These indexes significantly reduce the data that needs to be retrieved to satisfy the query and improve response time.

動的スキーマ
分析プラットフォームは、データ自体からスキーマを推論するので、ユーザは、推測的に予測スキーマを知る必要はなく、データがロードされる前にスキーマを予め宣言する必要もない。半構造データは、経時的に、また、ソースが異なることによって構造が変わり得る。よって、エンジンは、データが到着すると、動的に、そのデータからスキーマ（または、構造)を計算し、更新する。この計算されたスキーマに基づいた関係スキーマがユーザに提示され、ユーザはそれを用いて、クエリを構成する。 Since the dynamic schema analysis platform infers the schema from the data itself, the user does not have to guess the prediction schema speculatively, nor does it have to predeclare the schema before the data is loaded. Semi-structural data can change in structure over time and due to different sources. Thus, the engine dynamically calculates and updates schemas (or structures) from the data as it arrives. A relation schema based on this calculated schema is presented to the user, and the user uses it to construct a query.

クエリを行う前にプログラマがデータコレクションスキーマを指定する必要があった以前の分析エンジンとは違って、本プラットフォームは、全ての採集したオブジェクト間で基礎的スキーマを計算（または、推論）する。動的スキーマ特性のために、非常にフレキシブルである。ソースデータを生成するアプリケーションは、アプリケーションが進化するにつれて、構造を変更することができる。分析者は、スキーマが期間ごとにどのように異なるかを指定する必要なく、様々な期間からデータを集計し、クエリを行うことができる。さらに、数か月かかる場合があり、スキーマに適合しないデータを除く必要があることも多い、グローバルスキーマの設計、施行を行う必要がない。 Unlike previous analysis engines, where the programmer had to specify a data collection schema prior to querying, the platform computes (or infers) a basic schema among all the collected objects. Very flexible due to dynamic schema characteristics. Applications that generate source data can change structure as the application evolves. Analysts can aggregate data and query from different time periods without having to specify how schemas differ from time period to time period. Furthermore, there is no need to design and enforce a global schema, which may take months and often need to exclude data that does not fit into the schema.

「スキーマフリー」とも記載されることがあるＭａｐＲｅｄｕｃｅやＰｉｇのような他の分析システムには、２つの大きな欠点がある。一つ目は、データにクエリを行うためには、推論されたスキーマをユーザに自動的に提示するのではなく、ユーザにスキーマを知らせる必要がある。二つ目は、あらゆるクエリに関して、オブジェクト及びオブジェクトの構造を構文解析し、解釈する。一方、分析プラットフォームは、ロード時に、オブジェクトを構文解析し、インデックスを付ける。これらのインデックスによって、上述のように、その後のクエリが、はるかに高速に実行できる。以前のエンジンは、基礎的なデータから正確で簡潔なスキーマを自動的に推論しない。 Other analysis systems such as MapReduce and Pig, which may also be described as "schema free", have two major drawbacks. First, in order to query data, it is necessary to inform the user of the schema rather than automatically presenting the inferred schema to the user. The second parses and interprets objects and object structures for every query. On the other hand, the analysis platform parses and indexes objects at load time. These indexes allow subsequent queries to execute much faster, as described above. Previous engines do not automatically infer accurate and concise schemas from underlying data.

ＳＱＬ
分析プラットフォームは、標準ＳＱＬクエリインタフェース（例えば、ＡＮＳＩＳＱＬ２００３に対応したインタフェース）を公開するので、ユーザは、既存のＳＱＬツール（例えば、報告ツール、視覚化ツール、及び、ＢＩツール)及び専門知識を活用することができる。結果として、ＳＱＬまたはＳＱＬツールを熟知したビジネスユーザは、データウェアハウスをロードする必要なく、半構造データに直接アクセスやクエリを行うことができる。従来のＳＱＬベースツールは、ＪＳＯＮや他の半構造データフォーマットを扱わないので、分析プラットフォームは、ＪＳＯＮオブジェクトの計算されたスキーマの関係ビューを提示する。分析プラットフォームは、正規化ビューを提示し、最適化を組み込んで、ビューのサイズを管理する。関係ビューは、スキーマで幾つかのテーブルを提示してよいが、これらのテーブルは必ずしも具体化される必要はない。 SQL
The analysis platform exposes a standard SQL query interface (for example, an interface compatible with ANSI SQL 2003), so users can find existing SQL tools (for example, reporting tools, visualization tools, and BI tools) and expertise. It can be used. As a result, business users familiar with SQL or SQL tools can directly access and query semi-structured data without having to load a data warehouse. Because traditional SQL-based tools do not handle JSON or other semi-structured data formats, the analysis platform presents a relational view of the computed schema of the JSON object. The analysis platform presents normalized views, incorporates optimizations, and manages the size of the views. Relationship views may present some tables in a schema, but these tables need not necessarily be instantiated.

半構造データをテーブル形式で表現することにより良く対応するために、分析プラットフォームは、「マップ」オブジェクトを自動で識別することができる。マップは、フィールド名及び値の両方を検索、クエリすることができるオブジェクト（または、ネストされたオブジェクト）である。例えば、あるオブジェクトは、フィールド名としての日付と、値に関してはページビューのような統計を含んでよい。関係ビューにおいては、マップは、別個のテーブルに抽出され、キーはキーカラムに、値は値カラムにとなるように、データは、ピボットされる。 In order to better correspond by representing semi-structured data in tabular form, the analysis platform can automatically identify "map" objects. A map is an object (or nested object) that can search and query both field names and values. For example, an object may include a date as a field name and statistics such as a page view for values. In the relational view, the map is extracted into separate tables, and the data is pivoted so that the key is in the key column and the value is in the value column.

スケール及び弾力性
分析プラットフォームは、大きいデータセットサイズを取り扱うためにスケーリングされる。分析プラットフォームは、内部データ構造と処理を、独立ノードに、自動的に、動的に分散させることができる。 Scale and Elasticity The analysis platform is scaled to handle large data set sizes. The analysis platform can dynamically distribute internal data structures and processes to independent nodes automatically.

分析プラットフォームは、仮想「クラウド」環境のために、設計、構築される。仮想「クラウド」環境は、Ａｍａｚｏｎウェブサービス等のパブリッククラウドや、ユーザの組織が管理、または、Ｒａｃｋｓｐａｃｅ等のサードパーティが提供する、仮想サーバ環境等のプライベートクラウドを含む。シンプルストレージサービス（Ｓ３：ＳｉｍｐｌｅＳｔｏｒａｇｅＳｅｒｖｉｃｅ）、エラスティック計算クラウド（ＥＣ２：ＥｌａｓｔｉｃＣｏｍｐｕｔｅＣｌｏｕｄ）、及び、エラスティックブロックストレージ（ＥＢＳ：ＥｌａｓｔｉｃＢｌｏｃｋＳｔｏｒａｇｅ）を含む、Ａｍａｚｏｎウェブサービスの様々な構成要素を活用することができる。分析プラットフォームは、弾力的（エラスティック）である。つまり、分析プラットフォームは、需要に応じて、任意のサイズにスケールアップもスケールダウンもでき、内部のデータ構造を、ＡｍａｚｏｎＳ３等の長期のストアに記憶することによって、ハイバネートさせることができる。分析プラットフォームは、また、マルチテナンシー及びマルチユーザサポートを有する。 An analysis platform is designed and built for a virtual "cloud" environment. The virtual "cloud" environment includes public clouds such as Amazon web services, and private clouds such as virtual server environments managed by the user's organization or provided by third parties such as Rackspace. Leverages various components of Amazon Web Services, including Simple Storage Service (S3), Elastic Compute Cloud (EC2), and Elastic Block Storage (EBS) be able to. The analysis platform is elastic. That is, the analysis platform can be scaled up or down to any size, as needed, and can be hibernated by storing internal data structures in a long-term store such as Amazon S3. The analysis platform also has multi-tenancy and multi-user support.

分析プラットフォームは、プロキシ、メタデータサービス、クエリエグゼキュータ、及び、ストレージサービスの４つの構成要素を有するサービスベースのアーキテクチャを用いる。分析プラットフォームエンジンをスケーリングして、より大きいデータセットをサポートし、より高速な応答を提供し、かつ、より多くのユーザをサポートするために、実行エンジンを並列化し、ストレージサービスを、独立した、低コストのサーバノード全体でパーティション化する。これらのノードは、ホスト環境において、実サーバであってもよく、仮想サーバであってもよい。エグゼキュータとストレージサービスは、分離されているので、独立してスケーリングすることができる。この分離され、スケールアウトされたアーキテクチャによって、ユーザは、ＡＷＳのようなクラウド環境が提供するストレージと計算に関するオンデマンドの弾力性を活用することができる。 The analysis platform uses a service based architecture with four components: proxy, metadata service, query executor, and storage service. Parallelize the execution engine, separate storage services, low, to scale the analysis platform engine to support larger data sets, provide faster response, and support more users Partition across cost server nodes. These nodes may be real servers or virtual servers in the host environment. The executor and storage service are separated and can be scaled independently. This separated and scaled out architecture allows users to take advantage of the on-demand resiliency of storage and computing provided by cloud environments like AWS.

ストレージサービスは、様々なパーティション化戦略を備えて構成可能である。さらに、基礎的データ構造(インデックス及びメタデータ)は、使用中でないシステムをハイバネートさせるためにＡｍａｚｏｎＳ３等の長期のストレージに移すことができ、コストを削減することができる。 Storage services are configurable with various partitioning strategies. Furthermore, underlying data structures (indexes and metadata) can be transferred to long-term storage such as Amazon S3 to hibernate systems that are not in use, which can reduce costs.

同期
分析プラットフォームは、その内容を、Ｈａｄｏｏｐ分散ファイルシステム（ＨＤＦＳ）、Ａｍａｚｏｎシンプルストレージサービス（Ｓ３)、及び、ＭｏｎｇｏＤＢ等のｎｏＳＱＬストアのようなレポジトリからのソースデータと自動的に同期して、複製するように構成することができる。これらのソースについては、変更、追加、更新を絶えず監視することができるので、分析プラットフォームは、変更されたデータを採集することができる。これによって、クエリ結果を、比較的最新のものにすることができる。 Synchronization The analysis platform replicates its contents automatically, synchronized with source data from repositories such as Hadoop Distributed File System (HDFS), Amazon Simple Storage Service (S3), and noSQL stores such as MongoDB. It can be configured as follows. As these sources can be constantly monitored for changes, additions, and updates, the analysis platform can collect changed data. This allows the query results to be relatively current.

スキーマ推論
分析プラットフォームは、データがソースに出現することに応答して、以下のアクションを行う。（１）そのデータから統合された半構造（ＪＳＯＮ等）スキーマを推論（２）そのスキーマについて関係ビューを作成（３）物理的インデックスにデータをポピュレート（４）そのインデックスを活用するクエリを実行。アクション１、２、３の一部または全ては、データソースからのデータを通るパスを一度だけにできるように、パイプライン化されてよい。 Schema Inference The analysis platform performs the following actions in response to data appearing at the source: (1) Infer a semi-structured (JSON etc.) schema integrated from the data (2) Create a relational view for the schema (3) Populate the data into a physical index (4) Execute a query that utilizes the index. Some or all of actions 1, 2, 3 may be pipelined so that there is only one pass through data from the data source.

第１のアクションであるスキーマ推論を最初に記載する。 The first action, schema inference, is described first.

半構造データの概要
ＪＳＯＮは、ますます、普及している自己記述的な半構造データフォーマットで、インターネットでのデータ交換に広く用いられている。本明細書では、説明のためにＪＳＯＮを記載しており、以下の例も、ＪＳＯＮフォーマットを用いて記載しているが、本開示は、ＪＳＯＮに限定されない。 Overview of Semi-Structured Data JSON is an increasingly popular, self-descriptive, semi-structured data format that is widely used for data exchange on the Internet. In this specification, JSON is described for the sake of explanation, and the following example is also described using the JSON format, but the present disclosure is not limited to JSON.

簡単に言うと、ＪＳＯＮオブジェクトは、文字列フィールド（または、カラム）と、それに対応する、数、文字列、配列、オブジェクト等の潜在的に異なる型の値と、からなる。ＪＳＯＮオブジェクトは、ネストすることができ、フィールドは、配列、ネストされた配列など、多値であってよい。仕様は、ｈｔｔｐ：／／ＪＳＯＮ.ｏｒｇ．に記載されている。追加の詳細は、ｈｔｔｐ：／／ｔｏｏｌｓ．ｉｅｔｆ．ｏｒｇ／ｈｔｍｌ／ｄｒａｆｔ−ｚｙｐ−ｊｓｏｎ−ｓｃｈｅｍａ−０３で入手可能な２０１０年１１月２２日のインターネットエンジニアリングタスクフォース（ＩＥＴＦ）ｄｒａｆｔ−ｚｙｐ−ｊｓｏｎ−ｓｃｈｅｍａ−０３、「ＡＪＳＯＮＭｅｄｉａＴｙｐｅｆｏｒＤｅｓｃｒｉｂｉｎｇｔｈｅＳｔｒｕｃｔｕｒｅａｎｄＭｅａｎｉｎｇｏｆＪＳＯＮＤｏｃｕｍｅｎｔｓ」に記載されており、その開示内容の全体を、援用により本明細書に組み込むものとする。バイナリＪＳＯＮ（ＢＳＯＮ）等の他のＪＳＯＮ型を含むようにＪＳＯＮは一般化されている。さらに、拡張マークアップ言語（ＸＭＬ：ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ）、Ｐｒｏｔｏｂｕｆ、Ｔｈｒｉｆｔ等の、他の半構造フォーマットは、全て、ＪＳＯＮに変換することができる。ＸＭＬを用いる場合、クエリは、ＳＱＬではなく、ＸＱｕｅｒｙに従ってよい。 Simply put, a JSON object consists of a string field (or column) and the corresponding values of potentially different types, such as numbers, strings, arrays, objects, etc. JSON objects can be nested, and fields can be multi-valued, such as arrays, nested arrays, etc. The specifications are http://JSON.org. It is described in. Additional details are available at http: // tools. ietf. November 22, 2010 Internet Engineering Task Force (IETF) draft-zyp-json-schema-03, available at org / html / draft-zyp-json-schema-03, "A JSON Media Type for Describing the Structure and Meaning of JSON Documents, the entire disclosure content of which is incorporated herein by reference. JSON is generalized to include other JSON types such as binary JSON (BSON). In addition, other semi-structured formats such as Extensible Markup Language (XML), Protobuf, Thrift, etc. can all be converted to JSON. When using XML, the query may conform to XQuery, not SQL.

下記はＪＳＯＮオブジェクトの例である。
{"player": { "fname": "George", "lname": "Ruth", "nickname":
"Babe"}, "born": "February 6, 19 85",
"avg": 0.342, "HR": 714,
"teams": [ { "name": "Boston Red Sox", "years": "1914-1919"
},
{ "name": "New York Yankees", "years": "1920-1934" },
{ "name": "Boston Braves", "years": "1935" } ] } Below is an example of a JSON object.
{"player": {"fname": "George", "lname": "Ruth", "nickname":
"Babe"}, "born": "February 6, 19 85",
"avg": 0.342, "HR": 714,
"teams": [{"name": "Boston Red Sox", "years": "1914-1919"
},
{"name": "New York Yankees", "years": "1920-1934"},
{"name": "Boston Braves", "years": "1935"}]}

半構造オブジェクトの構造は、オブジェクトごとに異なり得る。従って、同じ野球のデータにおいて、オブジェクトは、
{ "player": { "fname": "Sandy", "lname": "Koufax"}, "born":
"December 30, 19 35",
"ERA": 2.76, "strikeouts": 2396,
"teams": [ { "name": "Brooklyn / LA Dodgers", "years": "1955-
1966" } ] }
となる。 The structure of semi-structured objects may differ from object to object. Therefore, in the same baseball data, the object is
{"player": {"fname": "Sandy", "lname": "Koufax"}, "born":
"December 30, 19 35",
"ERA": 2.76, "strikeouts": 2396,
"teams": [{"name": "Brooklyn / LA Dodgers", "years": "1955-
1966 "}]}
It becomes.

スキーマは、データコレクション内に見つかり得る構造及びデータ型を記述する。スキーマは、フィールド名、それに対応する値の型、及び、ネスト関係を含む。従って、上記２つのオブジェクトのスキーマは、
{ "player": { "fname": string, "lname": string, "nickname":
string }, "born": string, "avg": number, "HR": number, "ERA":
number, "strikeouts": number,
"teams": [ { "name": string, "years": string } ] }
となる。 A schema describes the structures and data types that can be found in a data collection. The schema includes field names, their corresponding value types, and nested relationships. Thus, the schema of the above two objects is
{"player": {"fname": string, "lname": string, "nickname":
string}, "born": string, "avg": number, "HR": number, "ERA":
number, "strikeouts": number,
"teams": [{"name": string, "years": string}]}
It becomes.

上記は、スキーマを説明するためにドキュメントを通じて用いられる表記法であるが、より完全な仕様は、ＪＳＯＮスキーマであり、ｈｔｔｐ：／／ＪＳＯＮ−ｓｃｈｅｍａ．ｏｒｇ．で入手できる。例えば、ＪＳＯＮスキーマの型は、一般的に、文字列または「整数（ｉｎｔ）」として引用符の中に含まれる。本開示では、簡潔さと読みやすさのために、引用符は省略する。 Although the above is a notation used throughout the document to describe the schema, a more complete specification is the JSON schema, http: // JSON-schema. org. Available at For example, JSON schema types are generally contained in quotation marks as strings or "integers". In the present disclosure, quotation marks are omitted for the sake of brevity and readability.

半構造オブジェクトは、あるいは、ノードとしてのフィールドと、アトミックな値としてのリーフを備えたツリーとして見ることができる。オブジェクトまたはスキーマにおける経路は、例えば、“ｐｌａｙｅｒ．ｆｎａｍｅ”、“ｔｅａｍｓ[]．ｎａｍｅ”等のこのツリーにおける経路である。 Semi-structured objects can alternatively be viewed as a tree with fields as nodes and leaves as atomic values. The paths in the object or schema are, for example, paths in this tree such as "player.fname", "teams []. Name", etc.

反復スキーマ推論
ユーザは、データセットの質問をする前に、スキーマを知る必要がある、すなわち、クエリを行うのにどんなフィールドまたは次元が利用可能かを知る必要がある。多くの場合、分析者は、データの生成に関わっておらず、何が記録されていて、何が入手可能か、分からない。例えば、上記の野球の例では、分析者は、打者がコレクション内で観察された場合のみ、“ＥＲＡ”フィールドが利用可能なことを知らない場合がある。従って、分析プラットフォームは、採集したデータから統合スキーマを計算（または、推論）し、スキーマの関係ビューを提示して、分析者がクエリを生成するのを支援する。 Iterative Schema Inference Before users can query the data set, they need to know the schema, i.e. what fields or dimensions are available to query. In many cases, analysts are not involved in the generation of data and do not know what is being recorded and what is available. For example, in the baseball example above, the analyst may not know that the "ERA" field is available only if the batter is observed in the collection. Thus, the analysis platform calculates (or infers) an integrated schema from the collected data and presents a relational view of the schema to assist the analyst in generating the query.

分析プラットフォームは、スキーマの正確さと簡潔さを最適化することを目指して、スキーマを生成することを目的としている。一般的に、正確というのは、スキーマが、観察または採集したデータ内の構造を全て表し、まだ見られない構造を許容しないこと、を意味する。簡潔というのは、スキーマが、人間が読み、解釈できるほど十分に小さいこと、を意味する。 The analysis platform aims to generate schemas, aiming to optimize the correctness and simplicity of the schemas. In general, correct means that the schema represents all structures in the observed or collected data and does not allow structures not yet found. By brevity is meant that the schema is small enough for humans to read and interpret.

スキーマを動的に作成するための一般的なアプローチは、過去のオブジェクトから推論した「現在の」スキーマを用いて開始し、新しいオブジェクトを採集しながら、スキーマを成長させる。我々は、下記のように、単純に、現在のスキーマ（Ｓ＿ｃｕｒｒ）を、新しいオブジェクト（Ｏ＿ｎｅｗ）のスキーマ（ｔｙｐｅ）とマージして、新しいスキーマ（Ｓ＿ｎｅｗ）
S_new = merge(S_curr, type(Ｏ_new))
とする。 A common approach to dynamically create schemas is to start with the "current" schema inferred from past objects, and grow the schema while collecting new objects. We simply merge the current schema (S_curr) with the new object (O_new) schema (type), as shown below, and the new schema (S_new)
S_new = merge (S_curr, type (O_new))
I assume.

大まかにいうと、マージプロセスは、２つのスキーマを結合し、共通のフィールド、サブオブジェクト、及び、配列を折り畳み、新しいものが現れると、新しいものを追加する。これについて、以下により詳しく記載する。 Broadly speaking, the merge process combines two schemas, folds common fields, sub-objects, and arrays, and adds new ones as they appear. This is described in more detail below.

オブジェクト
以下の例の一部は、ｆｉｒｅｈｏｓｅと言うツイッター（Ｔｗｉｔｔｅｒ）からのデータストリームの出力に似たデータを使用している。ツイッターｆｉｒｅｈｏｓｅは、「ツイートされた」ツイートと、それらのツイートのメタデータ（例えば、ユーザ、場所、トピックなど）とを表すＪＳＯＮオブジェクトのストリーム（終わりのないシーケンス）を供給する。これらのツイートは、現代のウェブフレームワーク（例えば、ＲｕｂｙｏｎＲａｉｌｓ）、モバイルアプリケーション、センサ、及び、装置（電力計、サーモスタット）などで生成されるような、多くの他の型のイベントログデータと類似している。下記の例は、ツィッターデータと類似しているが、説明目的のために、実際のツイッターデータとは異なっている。 Objects Some of the following examples use data similar to the output of a data stream from Twitter called firehose. The Twitter firehose provides a stream (a sequence without end) of JSON objects representing "tweeted" tweets and their tweets metadata (e.g., user, place, topic, etc.). These tweets come with many other types of event log data, such as those generated by modern web frameworks (eg Ruby on Rails), mobile applications, sensors and devices (power meters, thermostats) etc. It is similar. The example below is similar to Twitter data, but differs from the actual Twitter data for illustrative purposes.

基本的なＪＳＯＮオブジェクトは、取り扱いが容易である。すなわち、単純にオブジェクトに見られる型を推論する。例えば、下記のオブジェクトを考える。
{ "created_at": "Thu Nov 08", "id": 266353834,
"source": "Twitter for iPhone",
"text": "@ilstavrachi: would love dinner. Cook this:
http://bit.ly/955Ffo",
"user": { "id": 29471497, "screen_name": "Mashah08" },
"favorited": false} Basic JSON objects are easy to handle. That is, simply infer the types found in the object. For example, consider the following object:
{"created_at": "Thu Nov 08", "id": 266353834,
"source": "Twitter for iPhone",
"text": "@ilstavrachi: would love dinner. Cook this:
http://bit.ly/955Ffo ",
"user": {"id": 29471497, "screen_name": "Mashah08"},
"favorited": false}

オブジェクトから推論されたスキーマは、
{ "created_at": string, "id": number, "source": string, "text":
string,
"user": { "id": number, "screen_name": string }, "favorited":
boolean }
となる。 The schema inferred from the object is
{"created_at": string, "id": number, "source": string, "text":
string,
"user": {"id": number, "screen_name": string}, "favorited":
boolean}
It becomes.

新しいオブジェクトが到着すると、フィールドのセットに関して統合を行うことによって、新しいフィールドを追加することができる。時には、フィールドは繰り返されるが、その型が異なり、型多様性と呼ばれる状態になる。スキーマは、同じキーを有する複数の属性を用いて、型多様性を表す。. When new objects arrive, new fields can be added by merging on the set of fields. Sometimes the field is repeated but its type is different and it becomes a condition called type diversity. A schema represents type diversity using multiple attributes with the same key. .

ログフォーマットは、よく変わるので、開発者は新しいフィールドを追加、または、フィールド型の変更を行ってよい。具体例として、ツイートを識別する“ｉｄ”フィールドを考える。“ｉｄ”フィールドは、元々は、数字であった。しかしながら、ツイートの数が増えるにつれて、一部のプログラミング言語は、大きな数字を処理できなくなり、“ｉｄ”フィールドは、文字列に変更された。従って、その形式の新しいレコードは、
{ "created_at": "Thu Nov 10", "id": "266353840",
"source": "Twitter for iPhone",
"text": "@binkert: come with me to @ilstavrachi place",
"user": { "id": 29471497, "screen_name": "Mashah08" },
"retweet_count": 0 }
となる。 The log format changes often, so developers may add new fields or make field type changes. As a specific example, consider an "id" field that identifies a tweet. The "id" field was originally a number. However, as the number of tweets increased, some programming languages could not handle large numbers and the "id" field was changed to a string. Thus, a new record of that format is
{"created_at": "Thu Nov 10", "id": "266353840",
"source": "Twitter for iPhone",
"text": "@binkert: come with me to @ilstavrachi place",
"user": {"id": 29471497, "screen_name": "Mashah08"},
"retweet_count": 0}
It becomes.

文字列“ｉｄ”が見られ、新しいフィールド“ｒｅｔｗｅｅｔ＿ｃｏｕｎｔ”が出現したので、スキーマは下記のように補強される。
{ "created_at": string, "id": number, "id": string, "source":
string, "text": string,
"user": { "id": number, "screen_name": string },
"retweet_count": number } As the string "id" is seen and a new field "retweet_count" appears, the schema is augmented as follows.
{"created_at": string, "id": number, "id": string, "source":
string, "text": string,
"user": {"id": number, "screen_name": string},
"retweet_count": number}

“ｉｄ”が一度は文字列として、一度は数字として、二度現れることに留意する。時には、ネストされたオブジェクトの構造が変わる。例えば、下記のように、ユーザに関するプロファイル情報をさらに追加したとしよう。
{ "created_at": "Thu Nov 10", "id": "266353875",
"source": "Twitter for iPhone",
"text": "@binkert: come with me to @ilstavrachi place",
"user": { "id": "29471755", "screen_name": "mashah08",
"location": "Saratoga, CA", "followers_count": 22 },
"retweet_count": 0 } Note that "id" appears twice, once as a string and once as a number. Sometimes the structure of nested objects changes. For example, suppose you add more profile information about a user, as shown below.
{"created_at": "Thu Nov 10", "id": "266353875",
"source": "Twitter for iPhone",
"text": "@binkert: come with me to @ilstavrachi place",
"user": {"id": "29471755", "screen_name": "mashah08",
"location": "Saratoga, CA", "followers_count": 22},
"retweet_count": 0}

この場合、プラットフォームは、“ｕｓｅｒ”のネストされたスキーマを再帰的にマージして、下記のスキーマを得る。
{ "created_at": string, "id": number, "id": string, "source":
string, "text": string,
"user": { "id": number, "id": string, "screen_name": string,
"location": string, "followers_count": number },
"retweet_count": number } In this case, the platform recursively merges the nested schema of "user" to obtain the following schema:
{"created_at": string, "id": number, "id": string, "source":
string, "text": string,
"user": {"id": number, "id": string, "screen_name": string,
"location": string, "followers_count": number},
"retweet_count": number}

ヌル（Ｎｕｌｌ）フィールドと空のオブジェクト
ＪＳＯＮレコードには空のオブジェクトまたはヌルフィールドが、存在し得る。例えば、人の座標（緯度及び経度）のレコードが、
{ "coordinates": {} }
の場合、
スキーマは、下記のように、全く同じ型になる。
{ "coordinates": {} }
厳密に言うと、｛｝は、インスタンスと呼ばれ、型は、オブジェクトである。本開示の実施例及び説明は、説明を容易にするために厳密なＪＳＯＮとは異なっている。 Null Fields and Empty Objects There may be empty objects or null fields in the JSON record. For example, a record of human coordinates (latitude and longitude)
{"coordinates": {}}
in the case of,
The schema is exactly the same type, as shown below.
{"coordinates": {}}
Strictly speaking, {} is called an instance, and the type is an object. The embodiments and description of the present disclosure differ from strict JSON to facilitate the description.

同様に、下記のオブジェクト
{ "geo": null }
は、全く同じ型
{ "geo": null }
を有する。 Similarly, the objects below
{"geo": null}
Is exactly the same type
{"geo": null}
Have.

後続のレコードが、オブジェクトについて値を有する場合、空のオブジェクトは、マージを適用することによって書き込まれる。例えば、レコード
{ "coordinates": {} }
{ "coordinates": {"type": "Point"} }
から、下記のスキーマが生成される。
{ "coordinates": {"type": string} } If the following record has a value for an object, an empty object is written by applying a merge. For example, a record
{"coordinates": {}}
{"coordinates": {"type": "Point"}}
The following schema is generated from
{"coordinates": {"type": string}}

ヌル型は、同様に、観察された型によって置き換えられる。例えば、レコード
{ "geo": null }
{ "geo": true }
は、下記のスキーマを生成する。
{ "geo": boolean } The null form is likewise replaced by the observed form. For example, a record
{"geo": null}
{"geo": true}
Generates the following schema:
{"geo": boolean}

配列
ツイートは、ハッシュタグ（強調されるトピックの単語）、ｕｒｌ、他のツイッターユーザの記載等の項目を含むことが多い。ツイッターｆｉｒｅｈｏｓｅは、例えば、ツイートのＪＳＯＮオブジェクトに含めるために、自動的にこれらの項目を構文解析し、抽出してよい。下記の例では、ハッシュタグのメタデータを用いて、配列のスキーマをどのように推論するかを説明する。 Sequences Tweets often contain items such as hashtags (words for highlighted topics), urls, descriptions for other Twitter users, and so on. Twitter firehose may automatically parse and extract these items for inclusion in the JSON object of the tweet, for example. The following example uses hashtag metadata to illustrate how to deduce the schema of an array.

最初に、下記のツイート（または、文字列)のハッシュタグの開始オフセットのリストを抽出、記録することを考える。
"#donuts #muffins #biscuits"
これらのオフセットは、下記の配列で表してよい。
{ "offsets": [0, 8, 17] } First, consider extracting and recording a list of hash tag start offsets of the following tweets (or strings):
"#donuts #muffins #biscuits"
These offsets may be represented by the following arrangement:
{"offsets": [0, 8, 17]}

ソースデータの配列は、ソース配列に見られる要素の型を含む配列として、順序は関係なく、スキーマで表される。従って、上記オブジェクトのスキーマは、
{ "offsets": [number] }
である。 An array of source data is represented by a schema, regardless of order, as an array containing the types of elements found in the source array. Thus, the schema of the above object is
{"offsets": [number]}
It is.

後の処理のために、オフセットと共にハッシュタグを含めたい場合、ツイートオブジェクトは、下記のように、ハッシュタグとオフセットの両方を配列の中に列挙してよい。
{ "tags": [0, "donuts", 8, "muffins", 17, "biscuits"] }
対応するスキーマは、下記のように、配列に両方の型を含む。
{ "tags": [ number, string ] } If you want to include a hashtag with the offset for later processing, the Tweet object may list both the hashtag and the offset in the array, as described below.
{"tags": [0, "donuts", 8, "muffins", 17, "biscuits"]}
The corresponding schema contains both types in the array, as follows:
{"tags": [number, string]}

あるいは、タグ及びオフセットは、下記のように、置き換えられてもよい。
{ "tags": ["donuts", 0, "muffins", 8, "biscuits", 17] }
“ｔａｇｓ”配列は、文字列、数字のどちらも含むことができるので、結果として生じるスキーマは、
{ "tags": [ string, number ] }
となる。 Alternatively, tags and offsets may be replaced as follows.
{"tags": ["donuts", 0, "muffins", 8, "biscuits", 17]}
The "tags" array can contain either strings or numbers, so the resulting schema is
{"tags": [string, number]}
It becomes.

実際に、タグテキスト及びタグオフセットは、下記のように、隣り合うオブジェクトに含めることができる。
{ "tags": ["donuts", "muffins", "biscuits"] },
{ "tags": [0, 8, 17] }
“ｔａｇｓ”に関する２つのスキーマがある。
{ "tags": [string] } and { "tags": [number] }
この場合、配列は、同じ深さにあるので、マージして、上記と同じスキーマを生成することができる。
{ "tags": [ string, number ] } In fact, tag text and tag offsets can be included in adjacent objects, as described below.
{"tags": ["donuts", "muffins", "biscuits"]},
{"tags": [0, 8, 17]}
There are two schemas for "tags".
{"tags": [string]} and {"tags": [number]}
In this case, the arrays are at the same depth, so they can be merged to generate the same schema as above.
{"tags": [string, number]}

また、下記のスキーマは、全く同じであることに留意する。
{ "tags": [string, number] }
{ "tags": [number, string] }
これは、型のリストがセットとして扱われたからである。配列要素の型は、可能な場合、マージされ、マージは、配列内の、オブジェクト及び配列に対してさらに行われる。様々な他の実装において、（配列及びオブジェクトの両方における)型の順序、及び、型間の依存性は、保存することができる。しかしながら、これによって、スキーマの簡潔さは低下する。 Also note that the following schema is exactly the same.
{"tags": [string, number]}
{"tags": [number, string]}
This is because the list of types was treated as a set. Array element types are merged, where possible, and merging is further performed on objects and arrays within the array. In various other implementations, the order of types (both in arrays and objects) and dependencies between types can be preserved. However, this reduces the simplicity of the schema.

ネストされたオブジェクト
ネストされたオブジェクトを説明するために、開始オフセットと終了オフセットの両方が下記のように記録されているとしよう。
{ "tags": [{ "text": "donuts", "begin": 0 }, { "text": "donuts",
"end": 6 }]}
結果として生じるスキーマは、
{ "tags": [{"text": string, "begin": number,
"end": number }〕 }
となる。
このように、オブジェクト型は、配列要素を別個に型付けせずに、マージされる。 Nested Objects To illustrate nested objects, let us assume that both the start and end offsets are recorded as follows:
{"tags": [{"text": "donuts", "begin": 0}, {"text": "donuts",
"end": 6}]}
The resulting schema is
{"tags": [{"text": string, "begin": number,
"end": number}]}
It becomes.
Thus, object types are merged without separately typing array elements.

同様に、タグ文字列及びオフセットが、ネストされた配列内にある場合、
{ "tags": [ [ "donuts", "muffins" ], [0,8] ] } ==>
{ "tags": [[string], [number]]},
スキーマは、さらに、縮小されて、
{ "tags": [[string, number]]}
となる。
これは、本開示の様々な実装において行われる、スキーマの正確さと、スキーマの簡潔さとの間のトレードオフである。 Similarly, if the tag string and the offset are in a nested array:
{"tags": [["donuts", "muffins"], [0,8]]} ==>
{"tags": [[string], [number]]},
The schema is further reduced,
{"tags": [[string, number]]}
It becomes.
This is a trade-off between schema correctness and schema simplicity, which is done in various implementations of the present disclosure.

空のオブジェクト及び空の配列は、下記のように扱われる。空のオブジェクトは、上記のように書き込まれるので、下記の例のスキーマの縮小が可能である。
{ "parsed": { "tag": {}, "tag": { "offset": number } } }
=> { "parsed": { "tag": { "offset": number }}
同様に、配列に関するマージ規則を用いて、下記のスキーマ縮小を行う。
{ "tags": [[], [ number ]] } => { "tags": [[ number ]] }
{ "tags": [[], [[]]] } => { "tags": [[[]]] }
{ "tags": [[], [[]], [number]] } => { "tags": [[[]], [number]] }
=> { "tags": [[[], number]]] } Empty objects and empty arrays are treated as follows: Since empty objects are written as above, it is possible to reduce the schema of the example below.
{"parsed": {"tag": {}, "tag": {"offset": number}}}
=>{"parsed":{"tag":{"offset": number}}
Similarly, the following schema reduction is performed using merge rules for arrays.
{"tags": [[], [number]]} =>{"tags": [[number]]}
{"tags": [[], [[]]]} =>{"tags": [[[]]]}
{"tags": [[], [[]], [number]]} =>{"tags": [[[]], [number]]}
=>{"tags": [[[], number]]]}

マージ手順
以前のスキーマと新しいオブジェクトから新しいスキーマを作成するために、分析プラットフォームは、最初に、新しいオブジェクトを型付け（すなわち、新しいオブジェクトのスキーマを計算）する。この手順は、型付けのための標準的意味論（canonical semantics)を特定することを意図しており、特定の実装を記載することを意図したものではない。下記の記載において、変数ｖ，ｗ，ｖ＿ｊ，ｗ＿ｊは、任意の有効なＪＳＯＮ値の範囲に亘ってよく、ｊ，ｋ，ｊ＿ｍ，ｋ＿ｎは、有効な文字列に亘ってよい。型付けの基本的規則は、下記のようになる。
type(scalar v) = scalar_type of v
type({ k_l: v_l, ..., k_n: v_n }) =
collapse({ k_l: type(v_l), ..., k_n: type(v_n) })
type([ v_1, ..., v_n ]) =
collapse([ type(v_l), type(v_n) ]) Merge Procedure In order to create a new schema from the old schema and the new object, the analysis platform first types in the new object (ie calculates the schema of the new object). This procedure is intended to identify canonical semantics for typing, and is not intended to describe a particular implementation. In the following description, the variables v, w, v_j, w_j may span any valid JSON value range, and j, k, j_m, k_n may span valid strings. The basic rules of typing are as follows:
type (scalar v) = scalar_type of v
type ({k_l: v_l, ..., k_n: v_n}) =
collapse ({k_l: type (v_l), ..., k_n: type (v_n)})
type ([v_1, ..., v_n]) =
collapse ([type (v_l), type (v_n)])

第１の規則は、３や“ａ”等のスカラについて、対応する型は、値自体（３という数字、または、“ａ”という文字列)から直接推論されると、単純に述べている。第２、第３の規則は、折り畳み関数を用いて、オブジェクトと配列を再帰的に型付けする。 The first rule simply states that for scalars such as 3 or "a", the corresponding type is inferred directly from the value itself (number 3 or the string "a"). The second and third rules use object folding to recursively type objects and arrays.

折り畳み関数は、オブジェクトの同じフィールドの型を繰り返しマージし、配列内のオブジェクト、配列、及び、共通の型をマージする。マージは、スカラ型に到達するまで、再帰的に続けられる。オブジェクトについて、折り畳み関数は、下記のようになる。
collapse({ k_l: v_l, ..., k_n: v_n }):
while k_i == k_j:
if v_i, v_j are scalar types and v_i == v_j OR
v_i, v_j are objects OR v_i, v_j are arrays:
replace {..., k_i: v_i, ..., k_j: v_j, ...}
with {..., k_i: merge(v_i, v_j), ...} The folding function repeatedly merges the types of the same field of objects, and merges the objects in the array, the array, and the common types. Merging continues recursively until a scalar type is reached. For an object, the folding function is
collapse ({k_l: v_l, ..., k_n: v_n}):
while k_i == k_j:
if v_i, v_j are scalar types and v_i == v_j OR
v_i, v_j are objects OR v_i, v_j are arrays:
replace {..., k_i: v_i, ..., k_j: v_j, ...}
with {..., k_i: merge (v_i, v_j), ...}

配列については、折り畳み関数は、下記のようになる。
collapse ( [ v_l, ..., v_n ] ) :
while v_i, v_j are scalar types and v_i == v_j OR
v_i, v_j are objects OR v_i, v_j are arrays:
replace [..., v_i, ..., v_j, ...]
with [..., merge(v_i, v_j), ...] For an array, the folding function is
collapse ([v_l, ..., v_n]):
while v_i, v_j are scalar types and v_i == v_j OR
v_i, v_j are objects OR v_i, v_j are arrays:
replace [..., v_i, ..., v_j, ...]
with [..., merge (v_i, v_j), ...]

マージ関数は、どのようにして二つ一組に値を組み合わせて、重複しているものを除き、配列／マップを組み合わせるかを記述する。オブジェクトについては、マージは、単純に、折り畳みを再帰的に呼び出し、共通のフィールドを折り畳む。
merge(v, v) = v
merge({}, { k_l: v_l, ..., k_n: v_n }) = { k_l: v_l, ..., k_n: v_n }
merge({ j_l: v_l, ..., j_n: v_n }, { k_l: w_l, ..., k_m: w_m } )
= collapse({ j_l: v_l,_ ..., j_n: v_n, k_l: w_l, ..., k_m: w_m
} ) The merge function describes how to combine values in pairs, remove duplicates, and combine arrays / maps. For objects, merges simply call folding recursively to fold common fields.
merge (v, v) = v
merge ({}, {k_l: v_l, ..., k_n: v_n}) = {k_l: v_l, ..., k_n: v_n}
merge ({j_l: v_l, ..., j_n: v_n}, {k_l: w_l, ..., k_m: w_m})
= collapse ({j_l: v_l, _ ..., j_n: v_n, k_l: w_l, ..., k_m: w_m
})

同様に、配列については、下記のようになる。
merge([], [v_l, ..., v_n]) = [v_l, ..., v_n]
merge([v_l, ..., v_n], [w_l, ..., w_m])
= collapse([v_l, ..., v_n w_l, ..., w_m]) Similarly, the sequence is as follows.
merge ([], [v_l, ..., v_n]) = [v_l, ..., v_n]
merge ([v_l, ..., v_n], [w_l, ..., w_m])
= collapse ([v_l, ..., v_n w_l, ..., w_m])

ヌルは、下記のように、保存される。
merge ( { "coordinates": { } }, { "coordinates": null },
{ "coordinates": [] } )
= { "coordinates": { }, "coordinates": [],"coordinates"null }
ＪＳＯＮのヌル（ｎｕｌｌ）は、数字の９が値であるように、値である。関係では、ＮＵＬＬは、特定された値がなかったことを示す。ＳＱＬにおいては、ヌル値は、ｔａｇｓ＜ｎｕｌｌ＞：ｂｏｏｌｅａｎで表される。ここで、Ｂｏｏｌｅａｎ値は、ヌル値が存在すれば、真（Ｔｒｕｅ）であり、さもなければ、ＮＵＬＬである。ＳＱＬユーザのためのスキーマを単純化するために、ユーザが、ＪＳＯＮのヌル値とＳＱＬのヌル値とを区別する必要がない場合、ｃｏｏｒｄｉｎａｔｅｓ＜ｎｕｌｌ＞カラムは、省略することができる。 Nulls are stored as follows:
merge ({"coordinates": {}}, {"coordinates": null},
{"coordinates": []})
= {"coordinates": {}, "coordinates": [], "coordinates" null}
JSON null is a value, as the number 9 is a value. In the relation, NULL indicates that there was no specified value. In SQL, null values are represented by tags <null>: boolean. Here, the Boolean value is True if a null value exists, otherwise it is NULL. If the user does not need to distinguish between JSON null values and SQL null values to simplify the schema for SQL users, the coordinates <null> column can be omitted.

累積例
上記の簡単な規則を用いると、深くネストされたＪＳＯＮレコードを型付けすることが可能である。例えば、ウェブページのページビュー統計を表す複雑な仮想（hypothetical）レコードを考える。
{ "stat": [ 10, "total_pageviews", { "counts": [1, [3]],
"page_attr": 7.0 }, { "page_attr": ["internal"]} ]}
下記のスキーマが生成される。
{ "stat": [number,
string,
{ "counts": [number, [number]]
"page_attr": number,
"page_attr": [string]
}〕} Cumulative Example With the simple rules above, it is possible to type deeply nested JSON records. For example, consider a complex hypothetical record that represents page view statistics of a web page.
{"stat": [10, "total_pageviews", {"counts": [1, [3]],
"page_attr": 7.0}, {"page_attr": ["internal"]}]}}
The following schema is generated:
{"stat": [number,
string,
{"counts": [number, [number]]
"page_attr": number,
"page_attr": [string]
}]}

様々な実施において、ＪＳＯＮスキーマフォーマットを用いて、推論されたスキーマを符号化することができる。このフォーマットは、標準化されており、追加のメタデータ(例えば、オブジェクトがマップであるか否か)を組み込むために容易に拡張することができる。しかしながら、かなり冗長で、スペース効率が良くないので、本開示の実施例には用いていない。例えば、ＪＳＯＮスキーマフォーマットにおいては、上記スキーマは、下記のように表される。
{
"type": "object",
"properties": {
"stat": {
"items": {
"type": [
"number",
"string",
{
"type": "object",
"properties": {
"counts": {
"items": {
"type": [
"number",
{
"items": {
"type": "number"
},
"type": "array"
}
〕
},
"type": "array"
},
"page_attr": {
"type": [
" “number",
{
"items": {
"type": "string"
},
"type"： "array"
}
〕
}
}
}
〕
},
"type": "array
}
}
} In various implementations, the JSON schema format can be used to encode inferred schemas. This format is standardized and can be easily extended to incorporate additional metadata (eg, whether the object is a map or not). However, it is not used in the embodiments of the present disclosure because it is rather redundant and space efficient. For example, in the JSON schema format, the above schema is expressed as follows.
{
"type": "object",
"properties": {
"stat": {
"items": {
"type": [
"number",
"string",
{
"type": "object",
"properties": {
"counts": {
"items": {
"type": [
"number",
{
"items": {
"type": "number"
},
"type": "array"
}
]
},
"type": "array"
},
"page_attr": {
"type": [
"" Number ",
{
"items": {
"type": "string"
},
"type": "array"
}
]
}
}
}
]
},
"type": "array
}
}
}

マップ装飾
開発者及び分析者は、多くの異なる目的のために、ＪＳＯＮオブジェクトと配列を使用することができる。特に、ＪＳＯＮオブジェクトは、オブジェクト及び「マップ」の両方として、頻繁に用いられる。例えば、開発者は、フィールドが日付で、値がページビューのような収集された統計である、オブジェクトを作成するかもしれない。別の実施例は、フィールドがユーザｉｄで、値がプロファイルである。これらの場合、オブジェクトは、静的オブジェクトというよりもマップデータ構造に近い。フィールド名はとても多いので、ユーザは、可能なフィールド名を必ずしも知っているわけではなく、フィールド名は動的に作成される。結果として、ユーザは、値をクエリするのと同じように、フィールドをクエリしたい場合がある。 Map Decorations Developers and analysts can use JSON objects and arrays for many different purposes. In particular, JSON objects are frequently used as both objects and "maps". For example, a developer may create an object whose fields are dates and values are collected statistics such as page views. Another example is that the field is a user id and the value is a profile. In these cases, objects are closer to map data structures than static objects. Because field names are so numerous, users do not necessarily know the possible field names, and field names are created dynamically. As a result, the user may want to query the field as well as querying the value.

この使用をサポートするために、分析プラットフォームは、マップを識別することができる。分析プラットフォームは、ヒューリスティックスを組み込んで、マップを識別する、また、どのネストされたオブジェクトをマップとして扱うべきか、扱うべきでないかをユーザが指定するのを可能にする。オブジェクトをマップとしてタグ付けすることは、装飾と呼ばれる。 To support this use, the analysis platform can identify maps. The analysis platform incorporates heuristics to identify the map and also allows the user to specify which nested objects should or should not be treated as maps. Tagging an object as a map is called decoration.

一般に、装飾は、初期ロード後に行われる。すなわち、初期採集でマップを識別する必要はない。装飾は、後に、第２のパスで、あるいは、データがさらに採集された後に行うことができる。さらに、マップは、必要に応じて、単純にオブジェクトに戻すことができる。 In general, the decoration is done after the initial loading. That is, there is no need to identify the map at initial collection. The decoration can be done later in the second pass or after the data has been collected further. In addition, maps can simply be turned back into objects as needed.

デフォルト設定で、ＪＳＯＮオブジェクトは、オブジェクト（または、Ｃ言語では、構造体）として扱われる。これは、オブジェクトに“ｏｂｊ＿ｔｙｐｅ”：ｏｂｊｅｃｔと注釈を付けることによって、ＪＳＯＮスキーマにおいて明示的に示すことができる。下記の例において使用される省略表記はＯ｛｝である。 By default, JSON objects are treated as objects (or structures in C). This can be explicitly indicated in the JSON schema by annotating the object as "obj_type": object. The shorthand notation used in the examples below is O {}.

マップにフラグを立てるために、ヒューリスティックスは、フィールドを含有するオブジェクト（コンテナ）と比べて、グループとして比較的まれに発生するフィールドを探す。マップに対しては、省略表記Ｍ｛｝が使用される。 To flag the map, the heuristics look for fields that occur relatively infrequently as a group compared to the object (container) that contains the fields. For maps, the shorthand notation M {} is used.

第１のパスでスキーマを計算する間、フィールドが発生する頻度を追跡する。データセットで頻度Ｆで発生するオブジェクト（またはネストされたオブジェクト）を考える。ｖ＿ｉをオブジェクトにおけるフィールドｉの頻度とし、Ｎを（型とは無関係に）オブジェクトの固有フィールドの数とする。割合（ｓｕｍ（ｖ＿ｉ）／Ｎ）／Ｆは、コンテナの頻度に対する平均フィールド頻度の割合である。この割合が、ユーザが設定可能な０．０１等の閾値未満の場合、含有オブジェクトは、マップとして指定される。様々な実装において、ＪＳＯＮスキーマにおける空のオブジェクトは、マップとして扱われる。 While computing the schema in the first pass, track the frequency with which the field occurs. Consider an object (or nested object) that occurs at frequency F in the data set. Let v_i be the frequency of field i in the object, and let N be the number of unique fields in the object (regardless of type). Ratio (sum (v_i) / N) / F is the ratio of average field frequency to container frequency. If this percentage is less than a user configurable threshold, such as 0.01, the contained object is designated as a map. In various implementations, empty objects in the JSON schema are treated as maps.

関係スキーマの作成
ソースデータセットにおけるＪＳＯＮオブジェクトのスキーマが推論された後、分析プラットフォームは、ＳＱＬユーザ及びＳＱＬベースのツールに公開することができる関係スキーマを生成する。目標は、ＪＳＯＮスキーマにおける包含関係を表す簡潔なスキーマを作成する一方で、ユーザに標準ＳＱＬの能力を与えることである。この関係スキーマは、装飾されたＪＳＯＮスキーマから生成され、基礎的半構造データセット上のビューである。変換を行うための一般化手順を記載する前に、どのようにＪＳＯＮスキーマが関係ビューに変換されるかについての幾つかの実施例を記載する。 Relationship Schema Creation Once the schema of the JSON objects in the source data set has been inferred, the analysis platform generates a relationship schema that can be exposed to SQL users and SQL-based tools. The goal is to give the user the power of standard SQL while creating a concise schema that represents containment relationships in the JSON schema. This relationship schema is generated from the decorated JSON schema and is a view on the underlying semi-structured data set. Before describing the generalization procedure for performing the conversion, we will describe some examples of how the JSON schema is converted to a relationship view.

オブジェクト
最も単純な実施例は、下記のスキーマ等の、簡単なスカラ型を有するオブジェクトである。
{ "created_at": string, "id": number, "text": string,
"source": string, "favorited": boolean }
この場合、オブジェクトのフィールドは、関係のカラムに直接的に翻訳する。
Root(created_at: str, id: num, text: str, source: str, favorited:
bool) Objects The simplest example is an object with a simple scalar type, such as the schema below.
{"created_at": string, "id": number, "text": string,
"source": string, "favorited": boolean}
In this case, the fields of the object translate directly into the columns of the relationship.
Root (created_at: str, id: num, text: str, source: str, favored:
bool)

トップレベルのオブジェクトの関係（またはテーブル）はここでは“Ｒｏｏｔ”と呼ばれるが、例えば、ソースコレクションの名前が存在する場合には、ソースコレクションの名前で置き換えることができる。スペース及び読みやすさのために、型名の文字列、数字、及び、ブール（ｂｏｏｌｅａｎ）は、ｓｔｒ、ｎｕｍ、及び、ｂｏｏｌに短縮されている。 The top-level object relationship (or table) is referred to herein as "Root", but can be replaced, for example, with the name of the source collection if one exists. For space and readability, typename strings, numbers and booleans have been shortened to str, num and bool.

型多様性をサポートするために、型を属性名に追加することができる。例えば、下記のスキーマを考える。
{ "created_at": string, "id": number, "id": string, "text":
string, "source": string, "favorited": boolean }
結果として生じる関係スキーマは、下記のように別個の"id"と、"id"カラムとを有することになる。
Root(created_at: str, id<num>: num, id<str>: str,
source: str, text: str, favorited: bool) Types can be added to attribute names to support type diversity. For example, consider the following schema.
{"created_at": string, "id": number, "id": string, "text":
string, "source": string, "favorited": boolean}
The resulting relationship schema will have separate "id" and "id" columns as follows.
Root (created_at: str, id <num>: num, id <str>: str,
source: str, text: str, favored: bool)

ネストされたオブジェクト
ネストされたオブジェクトは、外部キー関係と新たな関係を作り出す。例えば、ＪＳＯＮスキーマを考える。
{ "created_at": string, "id": number, "source": string, "text":
string,
"user": { "id": number, "screen_name": string },
"favorited": boolean }
対応する関係スキーマは、
Root(created_at: str, id: num, source: str, text: str, favorited:
bool, user: join_key)
Root.user(id_jk: join_key, id: num, screen_name: str)
である。 Nested Objects Nested objects create foreign key relationships and new relationships. For example, consider a JSON schema.
{"created_at": string, "id": number, "source": string, "text":
string,
"user": {"id": number, "screen_name": string},
"favorited": boolean}
The corresponding relationship schema is
Root (created_at: str, id: num, source: str, text: str, favored:
bool, user: join_key)
Root.user (id_jk: join_key, id: num, screen_name: str)
It is.

ネストされたオブジェクトは、その経路、この場合は“Ｒｏｏｔ．ｕｓｅｒ”によって名付けられた別個の関係に「正規化」される。サブオブジェクトを表す新たなテーブル内のカラム“Ｒｏｏｔ．ｕｓｅｒ”．“ｉｄ＿ｊｋ”は、カラム“Ｒｏｏｔ．ｕｓｅｒ”（テーブル“Ｒｏｏｔ”における“ｕｓｅｒ”カラム）のための外部キーである。型は、それを他のカラムから区別するために“ｊｏｉｎｋｅｙ”として指定されるが、実際の実装では、ｊｏｉｎ＿ｋｅｙ型は典型的には整数である。 The nested objects are "normalized" to their path, in this case a separate relationship named by "Root.user". Column "Root.user" in a new table representing a sub-object. "Id_jk" is a foreign key for the column "Root.user" (the "user" column in the table "Root"). The type is specified as "joinkey" to distinguish it from the other columns, but in practical implementations, the join_key type is typically an integer.

オブジェクトは、幾つかのレベルの深さにネストすることができる。例えば、リツイートオブジェクトは、リツイートしたユーザのプロファイルを含む、リツイートされた状態オブジェクトを含んでよく、結果として、下記のスキーマになる。
{ "created_at": string, "id": number, "source": string, "text":
string,
"user": { "id": number, "screen_name": string },
"retweeted_status": { "created_at": string, "id": number,
"user": { "id": number, "screen_name": string } },
"favorited": boolean }
対応する関係ビューは、下記のようになる。
Root(created_at: str, id: num, source: str,
text: str, favorited: bool,
user: join_key, retweeted_status: join_key)
Root.user(id_jk: join_key, id: num, screen_name: str)
Root.retweeted_status(id_jk: join_key, created_at: str, id: num,
user: join_key)
Root.retweeted_status.user(id_jk: join_key, id: num, screen_name:
str)
。“Ｒｏｏｔ．ｕｓｅｒ”、“Ｒｏｏｔ．ｒｅｔｗｅｅｔｅｄ＿ｓｔａｔｕｓ”、及び“Ｒｏｏｔ．ｒｅｔｗｅｅｔｅｄ＿ｓｔａｔｕｓ．ｕｓｅｒ”は、全て異なるテーブルに分かれることに留意する。 Objects can be nested at several levels of depth. For example, the retweet object may include retweeted state objects, including the profile of the retweeted user, resulting in the following schema:
{"created_at": string, "id": number, "source": string, "text":
string,
"user": {"id": number, "screen_name": string},
"retweeted_status": {"created_at": string, "id": number,
"user": {"id": number, "screen_name": string}},
"favorited": boolean}
The corresponding relationship view is as follows:
Root (created_at: str, id: num, source: str,
text: str, favored: bool,
user: join_key, retweeted_status: join_key)
Root.user (id_jk: join_key, id: num, screen_name: str)
Root.retweeted_status (id_jk: join_key, created_at: str, id: num,
user: join_key)
Root.retweeted_status.user (id_jk: join_key, id: num, screen_name:
str)
. Note that “Root.user”, “Root.retweeted_status”, and “Root.retweeted_status.user” are all divided into different tables.

１対１関係の最適化
ネストされたオブジェクトの関係においては、メインのテーブルのローから、ネストされたオブジェクトのテーブルのローに、１対１関係があることが多い。結果として、これらは、カラム名についてドット付き表記を用いて、単一のテーブルに１対１に折り畳むことができる。 Optimization of One-to-One Relationships In nested object relationships, there is often a one-to-one relationship between the rows of the main table and the rows of the table of nested objects. As a result, they can be folded one-to-one into a single table, using dotted notation for column names.

例えば、上記マルチ関係例は、下記のように平坦化され、
Root(created_at: str, id: num, source: str,
text: str, favorited: bool,
user.id: num, user.screen_name: str)
３レベルのネストされたオブジェクト例に関しては、下記のようになる。
Root(created_at: str, id: num, source: str,
text: str, favorited: bool,
user.id: num, user.screen_name: str,
retweeted_status.created_at: str,
retweeted_status.id: num,
retweeted_status.user.id: num,
retweeted_status.user.screen_name: str) For example, the multi relation example is flattened as follows:
Root (created_at: str, id: num, source: str,
text: str, favored: bool,
user.id: num, user.screen_name: str)
For a three-level nested object example:
Root (created_at: str, id: num, source: str,
text: str, favored: bool,
user.id: num, user.screen_name: str,
retweeted_status.created_at: str,
retweeted_status.id: num,
retweeted_status.user.id: num,
retweeted_status.user.screen_name: str)

関係スキーマは単純にＪＳＯＮスキーマ上のビューであるので、平坦化された、部分的に平坦化された、または別個の（平坦化されていない）関係スキーマは、基礎的データを修正すること無く、分析プラットフォームの求めに応じて、ユーザに提示できることに留意する。唯一の制限は、ユーザに、矛盾するテーブル定義を提示しないことである。 Since the relational schema is simply a view on the JSON schema, a flattened, partially flattened or separate (non-flattened) relational schema does not modify the underlying data Note that it can be presented to the user as required by the analysis platform. The only limitation is that the user is not presented with conflicting table definitions.

マップ
フィールドのセットをマップとして指定すること無く、対応する関係スキーマは、膨大な数のカラムを含んでよい。さらに、ユーザは、フィールド名をクエリすることを望む場合がある。例えば、ユーザは、１２月の平均ページビューを見付けたいかもしれない。 Without specifying a set of map fields as a map, the corresponding relationship schema may contain a large number of columns. Additionally, the user may wish to query field names. For example, the user may want to find an average page view for December.

これらの問題を解決するために、マップとして装飾される（ネストされた）オブジェクトのテーブルは、「ピボットする」ことができる。例えば、ウェブサイト上の各ページについての様々なメトリクス（日々のページビュー、クリック、費やされた時間等）を追跡するために、下記のスキーマを考える。
Ｏ{ "page_url": string, "page_id": number,
"stat_name": string,
"metric": M{ "2012-01-01": number, "2012-01-02": number, ...,
"2012-12-01": number, ... } } To solve these problems, the table of (nested) objects to be decorated as maps can be "pivoted". For example, consider the following schema to track various metrics (daily page views, clicks, time spent, etc.) for each page on a website.
O {"page_url": string, "page_id": number,
"stat_name": string,
"metric": M {"2012-01-01": number, "2012-01-02": number, ...,
"2012-12-01": number, ...}}

各日について、別個のカラムでテーブルを作るのではなく、フィールド及び値は、関係におけるキーと値のペアとして記憶することができる。
Root(page_url: str, page_id: num, stat_name: str, metric<map>:
join_key)
Root,metric<map>(id_jk: join_key, key: string, val: num) Instead of creating a table with separate columns for each day, fields and values can be stored as key-value pairs in the relationship.
Root (page_url: str, page_id: num, stat_name: str, metric <map>:
join_key)
Root, metric <map> (id_jk: join_key, key: string, val: num)

この場合、ｉｄカラムは、外部キーであり、どのレコード内に各マップエントリが元々存在したかを示す。１年分のページビューの場合、テーブル“Ｒｏｏｔ．ｍｅｔｒｉｃ”において３６５個のカラムを有する代わりに、カラムは２個だけである。“ｋｅｙ”カラムはフィールド名を記憶し、“ｖａｌ”カラムは値を記憶する。例えば、上記スキーマの場合、データベースは、
”ｗｗｗ．ｎｏｕｄａｔａ．ｃｏｍ／ｊｏｂｓ”（ｐａｇｅ＿ｉｄ２８４）について、これらのレコードを含んでよい。
Root("www.noudata.com/jobs", 284, "page_views", 3),
Root.metric<map>(3, "2012-12-01", 50),
Root.metric<map>(3, "2012-12-02", 30), ... In this case, the id column is a foreign key, and indicates in which record each map entry originally existed. In the case of a page view for one year, instead of having 365 columns in the table "Root.metric", there are only 2 columns. The "key" column stores field names and the "val" column stores values. For example, in the above schema, the database is
These records may be included for “www.noudata.com/jobs” (page_id 284).
Root ("www.noudata.com/jobs", 284, "page_views", 3),
Root.metric <map> (3, "2012-12-01", 50),
Root.metric <map> (3, "2012-12-02", 30), ...

ピボットは、マップに型多様性がある場合にも機能する。例えば、メトリクスが、センチメントを表し、センチメントが、カテゴリと、カテゴリの強さを示すスコアと、の両方を含むとしよう。
{ "page_url": "www.noudata.com/blog", "page_id": 285,
"stat_name": "sentiment"
"metric": { "2012-12-01": "agreement", "2012-12-01": 5,
"2012-12-05": "anger", "2012-12-05": 2, ... } }
ＪＳＯＮスキーマは、
0{ "page_url": string, "page_id": number,
"stat_name": string,
"metric": M{ "2012-12-01": string, "2012-12-01": number, ...,
"2012-12-05": string, "2012-12-05": number, ... } }
となる。 Pivots also work when there is type diversity in the map. For example, assume that a metric represents a sentiment and the sentiment includes both a category and a score indicating the strength of the category.
{"page_url": "www.noudata.com/blog", "page_id": 285,
"stat_name": "sentiment"
"metric": {"2012-12-01": "agreement", "2012-12-01": 5,
"2012-12-05": "anger", "2012-12-05": 2, ...}}
JSON schema is
0 {"page_url": string, "page_id": number,
"stat_name": string,
"metric": M {"2012-12-01": string, "2012-12-01": number, ...,
"2012-12-05": string, "2012-12-05": number, ...}}
It becomes.

関係スキーマを作成すると、新しい“ｖａｌ”カラムを、新しい型を含むようにマップ関係に追加することができる。下記に示すように、別の“ｖａｌ”カラムを、カラム名を区別するために、その型と共に同様に追加することができる。
Root(page_url: str, page_id: num, stat_name: str, metric<map>:
join_key)
Root,metric<map>(id_jk: join_key, key: string,
val<str>: str, val<num>: num) Once the relationship schema is created, a new "val" column can be added to the map relationship to include the new type. As shown below, another "val" column can be added along with its type as well to distinguish the column names.
Root (page_url: str, page_id: num, stat_name: str, metric <map>:
join_key)
Root, metric <map> (id_jk: join_key, key: string,
val <str>: str, val <num>: num)

上記ＪＳＯＮオブジェクトから生じるエントリは、下記のようになる。
Root.metric<map>(4, "2012-12-01", "agreement", NULL),
Root.metric<map>(4, "2012-12-01", NULL, 5),
Root.metric<map>(4, "2012-12-05", "anger", NULL),
Root.metric<map>(4, "2012-12-05", NULL, 2) ...
これらのマップがピボットされると、ユーザは、キーカラムに、それらが任意の他のカラムとなるように、述語及び関数を適用することができる。 The entry resulting from the above JSON object is as follows:
Root.metric <map> (4, "2012-12-01", "agreement", NULL),
Root.metric <map> (4, "2012-12-01", NULL, 5),
Root.metric <map> (4, "2012-12-05", "anger", NULL),
Root.metric <map> (4, "2012-12-05", NULL, 2) ...
When these maps are pivoted, the user can apply predicates and functions to the key columns such that they become any other columns.

ネストされたマップ
基本原理は、ネストされたマップに関しても同じである。日毎及び時間毎の統計のリストを考える。
M{"2012-12-01": M{ "12:00": number,
"01:00": number,
"02:00": number,
... },
"2012-12-02": M{ ... },
... }
結果として生じるスキーマは、
Root(id_jk: join_key, key: string, val<map>: join_key)
Root.val<map>(id_jk: join_key, key: string, val<num>: num)
となる。 Nested Maps The basic principle is the same for nested maps. Consider a list of daily and hourly statistics.
M {"2012-12-01": M {"12:00": number,
"01:00": number,
"02:00": number,
...},
"2012-12-02": M {...},
...}
The resulting schema is
Root (id_jk: join_key, key: string, val <map>: join_key)
Root.val <map> (id_jk: join_key, key: string, val <num>: num)
It becomes.

オブジェクトもマップ内にネストすることができる。
M{ "2012-12-01": Ｏ{ "sentiment": string,
"strength": number }
"2012-12-02": Ｏ{ ... }
... }
結果として生じる平坦化された関係スキーマは、
Root(id_jk: join_key, key: string, val<map>: join_key)
Root.val<map>(id_jk: join_key, sentiment: string,
strength: number)
となる。 Objects can also be nested within the map.
M {"2012-12-01": O {"sentiment": string,
"strength": number}
"2012-12-02": O {...}
...}
The resulting flattened relationship schema is
Root (id_jk: join_key, key: string, val <map>: join_key)
Root.val <map> (id_jk: join_key, sentiment: string,
strength: number)
It becomes.

空の要素
空のオブジェクトが、時々、データ内に現れる。スキーマを考える。
{ "created_at": string, "id": number, "source": string, "text":
string,
"user": { "id": number, "screen_name": string } }
ＪＳＯＮオブジェクトは、ここに示すように、ユーザ情報を伴わずに受信され得る。
{ "created_at": "Thu Nov 08",
"id": 266353834,
"source": "Twitter for iPhone",
"text": "@ilstavrachi: would love dinner. Cook this:
http://bit.ly/955Ffo",
"user": { } } Empty elements Empty objects sometimes appear in the data. Think about the schema.
{"created_at": string, "id": number, "source": string, "text":
string,
"user": {"id": number, "screen_name": string}}
JSON objects may be received without user information, as shown here.
{"created_at": "Thu Nov 08",
"id": 266353834,
"source": "Twitter for iPhone",
"text": "@ilstavrachi: would love dinner. Cook this:
http://bit.ly/955Ffo ",
"user": {}}

空のユーザオブジェクトは、下記の関係タプルで表すことができる。
Root("Thu Nov 08", 266353834, "Twitter for iPhone",
"@ilstavrachi: would love dinner. Cook this:
http://bit.ly/955Ffo", join_key)
Root.user(join_key, NULL, NULL) An empty user object can be represented by the following relation tuple:
Root ("Thu Nov 08", 266353834, "Twitt for iPhone",
"@ilstavrachi: would love dinner. Cook this:
http://bit.ly/955Ffo ", join_key)
Root.user (join_key, NULL, NULL)

採集されたユーザオブジェクトが全て、採集されたストリームに空のオブジェクトを有する場合、結果として生じるＪＳＯＮスキーマは、空のオブジェクトを含むことになる。例えば、このスキーマにおける最後のフィールド（“ｕｓｅｒ”）を見る。
{"id": number, "user": {}}
この場合、空のオブジェクト“ｕｓｅｒ”は、マップとして処理することができ、関係スキーマは、
Root(id: num, user<map>: join_key)
Root.user<map>(id_jk: join_key, key: string)
となる。 If all collected user objects have empty objects in the collected stream, the resulting JSON schema will contain empty objects. For example, look at the last field ("user") in this schema.
{"id": number, "user": {}}
In this case, the empty object "user" can be treated as a map, and the relationship schema is
Root (id: num, user <map>: join_key)
Root.user <map> (id_jk: join_key, key: string)
It becomes.

Ｒｏｏｔ．ｕｓｅｒ＜ｍａｐ＞は、値カラムを有さず、最初は空であることに留意する。しかしながら、新しいオブジェクトが採集されて、スキーマが変わる場合、Ｒｏｏｔにおける各レコードが、結合キーを既に割り当てられていることになるので、この設計によって、カラムを後に追加するのが容易になる。 Root. Note that user <map> has no value column and is initially empty. However, if new objects are collected and the schema changes, this design makes it easy to add columns later, as each record in the Root has already been assigned a join key.

配列
配列は、マップと同様に処理されるので、スキーマ翻訳はかなり類似している。主な違いは、マップの文字列“ｋｅｙ”フィールドが、配列インデックスに対応する整数型（ｉｎｔ）の“ｉｎｄｅｘ”フィールドに置き換えられることである。簡単な実施例は、下記であり、
{ "tags": [ string ] }
は、下記の関係スキーマにつながる。
Root(tags<arr>: join_key)
Root.tags<arr>(id_jk: join_key, index: int, val<str>: str) Sequencing Sequences are treated the same as maps, so schema translation is quite similar. The main difference is that the map's string "key" field is replaced by an integer "int" field corresponding to the array index. A simple example is
{"tags": [string]}
Leads to the following relationship schema.
Root (tags <arr>: join_key)
Root.tags <arr> (id_jk: join_key, index: int, val <str>: str)

型多様性及びネストされた配列は、マップについてと同じように機能する。下記のスキーマを考える。
{ "tags": [ number, string] }
は、下記の関係スキーマにつながる。
Root(tags<arr>: join_key)
Root.tags<arr>(id_jk: join_key, index: int,
val<num>: num, val<str>: str) Type diversity and nested sequences work the same as for maps. Consider the following schema.
{"tags": [number, string]}
Leads to the following relationship schema.
Root (tags <arr>: join_key)
Root.tags <arr> (id_jk: join_key, index: int,
val <num>: num, val <str>: str)

オブジェクトは、下記に示すように、配列内にネストされてよい。
{ "tags": [{ "text": string, "offset": number }] }
結果として生じる関係スキーマは、下記のように作成することができる。
Root(tags<arr>: join_key)
Root.tags<arr>(id_jk: join_key, index: int, val: join_key)
Root.tags<arr>.val(id_jk: join_key, text: str, offset: num) Objects may be nested within the array, as shown below.
{"tags": [{"text": string, "offset": number}]}
The resulting relationship schema can be created as follows.
Root (tags <arr>: join_key)
Root.tags <arr> (id_jk: join_key, index: int, val: join_key)
Root.tags <arr> .val (id_jk: join_key, text: str, offset: num)

１対１平坦化の最適化を使用すると、関係スキーマは、下記のようになる。
Root(tags<arr>: join_key)
Root.tags<arr>(id_jk: join_key, index: int,
val.text: str, val.offset: num) Using the one-to-one planarization optimization, the relationship schema is as follows:
Root (tags <arr>: join_key)
Root.tags <arr> (id_jk: join_key, index: int,
val.text: str, val.offset: num)

ネストされた空の配列
関係スキーマは、ネストされた空の配列について、マップについてと同様の方法で作成することができる。下記のスキーマに関して、
{ "tags": [string, [number]], "urls": []}
関係スキーマは、下記のようになる。
Root(tags<arr>: join_key, urls<arr>: join_key)
Root.tags<arr>(id_jk: join_key, index: int,
val<str>: str, val<arr>: join_key)
Root.tags<arr>.val<arr>(id_jk: join_key, index: int,
val<num>: num)
Root.urls<arr>(id_jk: join_key, index: int) Nested empty arrays Relationship schemas can be created for nested empty arrays in the same way as for maps. Regarding the schema below,
{"tags": [string, [number]], "urls": []}
The relationship schema is as follows.
Root (tags <arr>: join_key, urls <arr>: join_key)
Root.tags <arr> (id_jk: join_key, index: int,
val <str>: str, val <arr>: join_key)
Root.tags <arr> .val <arr> (id_jk: join_key, index: int,
val <num>: num)
Root.urls <arr> (id_jk: join_key, index: int)

ネストされた配列の場合、テーブルの名前に“ｖａｌ”が添えられた別個のテーブルが作成されることに留意する。空の配列の場合、別個のテーブルは、“ｖａｌ”カラム無しで、“ｉｎｄｅｘ”カラムだけで作成され、“ｖａｌ”カラムは、配列の内容が観察されて型付けされると、後で追加することができる。 Note that in the case of nested arrays, a separate table is created with "val" appended to the name of the table. In the case of an empty array, separate tables are created with only the "index" column, without the "val" column, and the "val" column should be added later when the contents of the array are observed and typed. Can.

アトミックな値に関する型推論
上記型推論及び関係スキーマへの変換手順は、ＪＳＯＮにおいて利用可能である基本型に依存している。いかなる型システムを選択しても、同じ手順が、その型システムに等しく適用される。言い換えれば、分析プラットフォームは、アトミックなスカラ型が値から推論され得る限り、整数、浮動小数、及び、時間のようなより狭いスカラ型を推論することができる。ＢＳＯＮ及びＸＭＬは、そのような拡張された型システムを有する。さらに、様々なヒューリスティックス（正規表現など）を用いて、日付や時間などのより複雑な型を検出することができる。 Type Inference for Atomic Values The above type inference and conversion procedures to relational schemas rely on the basic types available in JSON. The same procedure applies equally to any type system, whatever type system is chosen. In other words, the analysis platform can infer narrower scalar types such as integers, floats, and time, as long as atomic scalar types can be inferred from values. BSON and XML have such an extended type system. In addition, various heuristics (such as regular expressions) can be used to detect more complex types such as date and time.

ＡＮＳＩＳＱＬは、ＪＳＯＮと同じ型をサポートしていないので、推論された型は、これまで関係ビューについて見られた最も具体的な型に変換される。例えば、整数だけがフィールド“ｆｒｅｑ”について見られる場合には、数字型は、“ｆｒｅｑ”に関する関係スキーマにおいて整数に変換されることになる。同様に、整数と浮動小数の両方が観察されている場合には、関係スキーマは、“ｆｒｅｑ”カラムを浮動小数として示すことになる。同様に、文字列フィールドは、関係スキーマにおいて文字可変型に変換する。言い換えれば、基本ＪＳＯＮ型より具体的な型を追跡してよい。 Because ANSI SQL does not support the same types as JSON, inferred types are converted to the most specific type ever seen for relational views. For example, if only integers are found for the field "freq", the number type will be converted to integers in the relational schema for "freq". Similarly, if both integers and floats are observed, the relationship schema will show the "freq" column as a float. Similarly, string fields convert to character variant in the relational schema. In other words, you may track more specific types than basic JSON types.

あるいは、型多様性に依存して、より具体的な型システムを使用して、データ値の型を推論する。すなわち、ＪＳＯＮプリミティブ型を使用する代わりに、ＡＮＳＩＳＱＬのプリミティブ型を使用する。 Alternatively, depending on type diversity, more specific type systems are used to infer the type of data value. That is, instead of using JSON primitive types, we use ANSI SQL primitive types.

以下は、採集中に追跡される型のリスト（左側）と、それらがどのようにＳＱＬスキーマに関して変換されるか（右側）である。ほとんどのＳＱＬデータベースは、クライアントが求めれば、使用可能なテキストを含む追加型をサポートする。ＯｂｊｅｃｔＩｄ型はＢＳＯＮに特有であることに留意する。
int32, -> INTEGER
int64, -> INTEGER
double, -> DOUBLE PRECISION
string, -> VARCHAR
date, -> DATE
bool, -> BOOLEAN
object_id, (BSON) -> VARCHAR(24)
time -> TIME
timestamp -> TIMESTAMP The following is a list of types that will be tracked during collection (left) and how they are translated with respect to SQL schema (right). Most SQL databases support additional types, including available text, as the client requires. Note that the ObjectId type is specific to BSON.
int32,-> INTEGER
int64,-> INTEGER
double,-> DOUBLE PRECISION
string,-> VARCHAR
date,-> DATE
bool,-> BOOLEAN
object_id, (BSON)-> VARCHAR (24)
time-> TIME
timestamp-> TIMESTAMP

手順
ＪＳＯＮスキーマから関係スキーマへの変換は、ネストされたＪＳＯＮスキーマ構造の再帰的解凍を使用して達成することができる。実装例の疑似コード表現を、ここに示す。
Call for every attribute in topmost object:
attr_schema, "Root", attr_name
create_schema(json_schema, rel_name, attr_name):
/* Creates a table (relation) if it's adorned as an object */
if json_schema is object:
Add join key called attr_name to relation rel_name
new_rel = rel_name + "." + attr_name
Create relation new_rel
add (id_jk: join_key) to new_rel
/* recursively add attributes to the table (relation) */
for attr, attr_schema in json_schema:
create_schema(attr_schema, new_rel, attr)
/* Creates appropriate attrs and table for (nested) map */
else if json_schema is map:
Add join key called 'attr_name + <map>' to relation rel_name
new_rel = rel_name + "." + attr_name<map>
Create relation new_rel
Add (id_jk: join_key) and (key: string) to new_rel
/* recursively add attributes to the table (relation) */
for each distinct value type val_type in json_schema:
create_schema(val_type, new_rel, "val")
/* Creates appropriate attrs and table for array */
else if json_schema is array:
Add join key called 'attr_name + <arr>' to relation rel_name
new_rel = rel_name + "." + attr_name<arr>
Create relation new_rel
Add (id_jk: join_key) and (index: int) to new_rel
/* recursively add attributes to the table (relation) */
for each distinct item type item_type in json_schema:
create_schema(item_type, new_rel, "val")
/* Primitive type, add column to the table (relation) */
else:
If attr_name does not exist in relation rel_name:
Add column (attr_name, attr_name's type) to relation
rel_name
else
Rename attribute attr_name to attr_name + "<orignal
attr_name's type>" in relation rel_name
Add column (attr_name + "<" + attr_name's type + ">",
attr_name's type) to relation rel_name Procedure The transformation from JSON schema to relationship schema can be achieved using recursive decompression of nested JSON schema structure. A pseudo code representation of the example implementation is shown here.
Call for every attribute in topmost object:
attr_schema, "Root", attr_name
create_schema (json_schema, rel_name, attr_name):
/ * Creates a table (relation) if it's adorned as an object * /
if json_schema is object:
Add join key called attr_name to relation rel_name
new_rel = rel_name + "." + attr_name
Create relation new_rel
add (id_jk: join_key) to new_rel
/ * recursively add attributes to the table (relation) * /
for attr, attr_schema in json_schema:
create_schema (attr_schema, new_rel, attr)
/ * Creates appropriate attrs and table for (nested) map * /
else if json_schema is map:
Add join key called 'attr_name + <map>' to relation rel_name
new_rel = rel_name + "." + attr_name <map>
Create relation new_rel
Add (id_jk: join_key) and (key: string) to new_rel
/ * recursively add attributes to the table (relation) * /
for each distinct value type val_type in json_schema:
create_schema (val_type, new_rel, "val")
/ * Creates appropriate attrs and table for array * /
else if json_schema is array:
Add join key called 'attr_name + <arr>' to relation rel_name
new_rel = rel_name + "." + attr_name <arr>
Create relation new_rel
Add (id_jk: join_key) and (index: int) to new_rel
/ * recursively add attributes to the table (relation) * /
for each distinct item type item_type in json_schema:
create_schema (item_type, new_rel, "val")
/ * Primitive type, add column to the table (relation) * /
else:
If attr_name does not exist in relation rel_name:
Add column (attr_name, attr_name's type) to relation
rel_name
else
Rename attribute attr_name to attr_name + "<orignal
attr_name's type>"in relation rel_name
Add column (attr_name + "<" + attr_name's type + ">",
attr_name's type) to relation rel_name

上記手順によって、１対１最適化を行わずに関係スキーマを作成する。第２のパスは、関係スキーマを通して行ってよく、１対１関係でオブジェクトテーブルを識別し、折り畳む。あるいは、１対１最適化はインラインで実行することができるが、これについては、明確にするために示していない。ネストされたオブジェクトを有するスキーマのサブツリーが、配列またはマップの「割り込み」を受けない場合、オブジェクトサブツリー全体は、サブツリーのルートへの経路によって名付けられた属性を有する単一のテーブルに折り畳むことができる。マップまたはオブジェクトである属性は別個のテーブル内にとどまるが、内に含まれたサブオブジェクトはいずれも、再帰的に折り畳むことができる。これらの原理は、ネストされたオブジェクトの任意の深さに当てはまる。 The above procedure creates a relation schema without performing one-to-one optimization. The second pass may be through the relationship schema, identifying and collapsing object tables in a one-to-one relationship. Alternatively, one-to-one optimization can be performed inline, but this is not shown for clarity. If a schema subtree with nested objects does not receive an array or map "interrupt", the entire object subtree can be folded into a single table with attributes named by the path to the root of the subtree . Attributes that are maps or objects remain in separate tables, but any sub-object contained within can be recursively folded. These principles apply to any depth of nested objects.

インデックスにデータをポピュレート
ＪＳＯＮ及び関係スキーマが新しいオブジェクトに応答して更新されると、オブジェクト内に含まれるデータは、以下に記載するように、インデックスに記憶することができる。 Populating Data with Indexes Once the JSON and relationship schema have been updated in response to the new object, the data contained within the object can be stored in the index as described below.

分析プラットフォームにおけるインデックスは、キーと値のペアを記憶する、順序を保持するインデックスに依存する。インデックスは、操作、すなわち、ルックアップ（接頭辞）、挿入（キー、値）、削除（キー）、更新（キー、値）、及び、範囲検索のためのｇｅｔ＿ｎｅｘｔ（）をサポートする。そのようなインターフェースをサポートする多数のデータ構造及び低レベルライブラリがある。例には、ＢｅｒｋｅｌｅｙＤＢ、ＴｏｋｙｏＣａｂｉｎｅｔ、ＫｙｏｔｏＣａｂｉｎｅｔ、ＬｅｖｅｌＤＢなどが含まれる。これらは、Ｂ木、ＬＳＭ（ｌｏｇ‐ｓｔｒｕｃｔｕｒｅｄｍｅｒｇｅ）ツリー、及び、Ｆｒａｃｔａｌツリーのような順序を保持する二次記憶データ構造を内部で使用する。オブジェクトＩＤについて等、順序を保持しないインデックス（ハッシュテーブル等）が使用される特殊な場合があってよい。順序を保持しないインデックスを用いると、ｇｅｔ＿ｎｅｘｔ（）及び範囲検索を行う能力が犠牲になる場合がある。 The index in the analysis platform relies on an index holding index, which stores key-value pairs. The index supports operations: lookup (prefix), insert (key, value), delete (key), update (key, value), and get_next () for range search. There are many data structures and low level libraries that support such interfaces. Examples include BerkeleyDB, TokyoCabinet, KyotoCabinet, LevelDB, and the like. These internally use secondary storage data structures that maintain order, such as B-trees, LSM (log-structured merge) trees, and Fractal trees. There may be special cases where an index (such as a hash table) that does not hold an order, such as for object IDs, is used. Using an index that does not hold order may sacrifice the ability to do get_next () and range searching.

様々な実装において、分析フレームワークは、ＬｅｖｅｌＤＢを使用する。ＬｅｖｅｌＤＢは、ＬＳＭツリーを実装し、圧縮を行い、高挿入速度でデータセットに良好な性能を提供する。ＬｅｖｅｌＤＢは、また、分析フレームワークの共用モデルと調和し得る性能のトレードオフを行う。例えば、ログデータ等のデータを分析するとき、データは頻繁に追加されることになるが、既存のデータは、希に変更されるか、あるいは、全く変更されない。ＬｅｖｅｌＤＢは、データ削除及びデータ修正を遅くして、高速データ挿入のために最適化されると、有利である。 In various implementations, the analysis framework uses LevelDB. LevelDB implements an LSM tree, performs compression, and provides good performance for data sets at high insertion rates. LevelDB also makes performance tradeoffs that can be coordinated with the analytics framework's shared model. For example, when analyzing data, such as log data, data will be added frequently, but existing data is rarely changed or not changed at all. LevelDB is advantageous when it is optimized for fast data insertion, slowing data removal and data modification.

順序を保持するインデックスは、キー順序でキーと値のペアを併置する特性を有する。従って、あるキーの近くのキーと値のペアについて検索するとき、あるいは、順番に項目を取り出すとき、応答は、順不同で項目を取り出すときよりもはるかに速く、返される。 An index holding order has the property of juxtaposing key-value pairs in key order. Thus, when searching for key-value pairs near a key, or when retrieving items in order, responses are returned much faster than when retrieving items in random order.

分析プラットフォームは、各ソースコレクションについて多数のキーと値のインデックスを、また、一部の実装では、各ソースコレクションについて２つ〜６つのインデックスを維持することができる。分析プラットフォームは、関係スキーマ（ＳＱＬスキーマは具体化される必要はない）に対してＳＱＬクエリを評価するためにこれらのインデックスを使用する。各オブジェクトは、ｔｉｄで示される固有ｉｄを割り当てられる。ＢｉｇＩｎｄｅｘ（ＢＩ）及びＡｒｒａｙＩｎｄｅｘ（ＡＩ）は、そのインデックスから他のインデックス及びスキーマを再構築することができる２つのインデックスである。 The analysis platform can maintain multiple key and value indexes for each source collection, and in some implementations, 2 to 6 indexes for each source collection. The analysis platform uses these indexes to evaluate SQL queries against relational schemas (SQL schemas do not have to be instantiated). Each object is assigned a unique id indicated by tid. BigIndex (BI) and ArrayIndex (AI) are two indexes from which other indexes and schemas can be reconstructed.

ビッグインデックス（ＢＩ：ＢｉｇＩｎｄｅｘ）
ＢｉｇＩｎｄｅｘ（ＢＩ）は、配列内に埋め込まれていないデータの全フィールドを記憶するベースデータストアである。値（ｖａｌ）は、ｃｏｌ＿ｐａｔｈ及びｔｉｄに基づいたキーによってＢＩから取り出すことができる。
(col_path, tid) -> val Big Index (BI: BigIndex)
BigIndex (BI) is a base data store that stores all fields of data not embedded in the array. The value (val) can be retrieved from BI by a key based on col_path and tid.
(col_path, tid)-> val

ｃｏｌ＿ｐａｔｈは、フィールドの型が添えられたルートオブジェクトからのフィールドへの経路である。例えば、下記のレコードの場合、
1: { "text": "Tweet this", "user": { "id": 29471497,
"screen_name": "Mashah08" } }
2: { "text": "Tweet that", "user": { "id": 27438992,
"screen_name": "binkert" } }
下記のキーと値のペアがＢＩに追加される。
(root.text<str>, 1) --> "Tweet this"
(root.text<str>, 2) --> "Tweet that"
(root.user.id<num>, 1) --> 29471497
(root.user.id<num>, 2) --> 27438992
(root.user.screen_name<str>, 1) -> "Mashah08"
(root.user.screen_name<str>, 2) -> "binkert" col_path is the path from the root object to the field with the field type attached. For example, in the case of the following record,
1: {"text": "Tweet this", "user": {"id": 29471497,
"screen_name": "Mashah08"}}
2: {"text": "Tweet that", "user": {"id": 27438992,
"screen_name": "binkert"}}
The following key / value pairs are added to BI:
(root.text <str>, 1)->"Tweetthis"
(root.text <str>, 2)->"Tweetthat"
(root.user.id <num>, 1)-> 29471497
(root.user.id <num>, 2)-> 27438992
(root.user.screen_name <str>, 1)->"Mashah08"
(root.user.screen_name <str>, 2)->"binkert"

様々な実装において、基礎的インデックスストア（ＬｅｖｅｌＤＢ等）は、キーのセグメントの意味を認識していない。言い換えれば、“ｒｏｏｔ．ｔｅｘｔ＜ｓｔｒ＞，１”は、ルートテーブル内の文字列テキストフィールドの第１の要素を意味するが、インデックスストアは、区別されていない複数文字キーを単純に見ている。簡単な実施例として、キーは、ｃｏｌ＿ｐａｔｈ及びｔｉｄを（重要なことは、その順序で）単に連結することによって作成することができる。例えば、上記に示した第１のキーは、“ｒｏｏｔ．ｔｅｘｔ＜ｓｔｒ＞１”としてインデックスストアに渡されてよい。インデックスストアは、経路の類似性を理解してではなく、単に最初の１４文字が同じであるという理由で、第２のキー（“ｒｏｏｔ．ｔｅｘｔ＜ｓｔｒ＞２”）を第１のキーと併置する。カラム名及びカラム型は、ソート順により、あらゆるキーの一部として記憶されるが、圧縮（接頭辞ベースに圧縮など）を用いて、ストレージ費用を削減することができる。 In various implementations, the underlying index store (such as LevelDB) does not recognize the meaning of the key segment. In other words, "root.text <str>, 1" means the first element of the string text field in the root table, but the index store simply looks at the undifferentiated multi-character key . As a simple example, keys can be created simply by concatenating col_path and tid (importantly, in that order). For example, the first key shown above may be passed to the index store as "root.text <str> 1". The index store does not understand the similarity of the paths, but simply co-locates the second key ("root.text <str> 2") with the first key simply because the first 14 characters are the same. Do. Column names and column types are stored as part of every key according to sort order, but compression (such as prefix-based compression) can be used to reduce storage costs.

ＢＩにおいては、あらゆる新しいカラムについて別個のカラムファイルを作成する従来のカラムストアとは異なり、ソースデータの全てのカラムが単一構造に組み合わされる。ＢＩアプローチは、単一インデックスインスタンスを可能にし、また、マップ検出の遅延を可能にする。新たなフィールドは、単純にＢＩ内にエントリとして現れるので、マップのピボットを怠っても、後にマップに戻される各フィールドについて多数の物理ファイルを作成する物理的費用が発生しない。 In BI, unlike traditional column stores that create separate column files for every new column, all columns of source data are combined into a single structure. The BI approach enables single index instances and also allows map detection delays. Since the new fields simply appear as entries in the BI, neglecting to pivot the map does not incur the physical cost of creating multiple physical files for each field that is subsequently returned to the map.

ＢＩにおいては、各属性または「カラム」についてのキーと値のペアが併置される。従って、カラムファイル同様、ＢＩは、クエリエグゼキュータにクエリにおいて参照されないフィールドを含むデータ全体を通らせるのではなく、クエリエグゼキュータにクエリの関心フィールドに焦点を当てさせることができる。 In BI, key-value pairs for each attribute or "column" are collocated. Thus, like column files, BI can cause the query executor to focus on the interest fields of the query rather than having the query executor pass through the data including fields not referenced in the query.

配列インデックス（ＡＩ：ＡｒｒａｙＩｎｄｅｘ)
配列についての正規化テーブルからフィールドがＢＩに追加できるが、配列インデックスは、対応する値からとなる。代わりに、配列フィールドは、インデックス情報を保持する個別のＡｒｒａｙｉｎｄｅｘ（ＡＩ）に追加することができ、同じ配列内のエントリが、インデックスストアによって併置されることを可能にし、それによって、多くのクエリに良好な性能を提供する。配列値は、下記の署名を用いてＡＩに記憶することができる。
(col_path, tid, join_key, index) -> val Array index (AI: ArrayIndex)
Fields can be added to BI from the normalization table for arrays, but the array index is from the corresponding value. Instead, array fields can be added to a separate Arrayindex (AI) that holds index information, allowing entries in the same array to be collocated by the index store, thereby making it possible for many queries Provide good performance. Sequence values can be stored in the AI using the following signature:
(col_path, tid, join_key, index)-> val

ｃｏｌ＿ｐａｔｈは、配列フィールドの経路、例えば、タグ配列の要素についての“ｒｏｏｔ．ｔａｇｓ”、または、タグ配列内のオブジェクトの“ｔｅｘｔ”フィールドについての“ｒｏｏｔ．ｔａｇｓ．ｔｅｘｔ”である。ｊｏｉｎ＿ｋｅｙ及びインデックスは、配列の外部キー及び値のインデックスである。各配列についてＢＩに別個のエントリを記憶する必要がないように、ｔｉｄも記憶される。ｔｉｄを用いて、同じオブジェクト内の対応するカラムについて値をルックアップすることができる。別々のツイートにおいて、ハッシュタグを表すオブジェクトを考える。
1: { "id": 3465345, "tags": [ "muffins" "cupcakes" ] }
2: { "id": 3465376, "tags": [ "curry" "sauces" ] }
これらに関して、タグテーブルは、下記のスキーマを有する。
Root.tags<arr>(id_jk: join_key, index: int, val: string)
このテーブルについて、ＡＩ内のエントリは下記のようになる。
(root.tags<arr>, 1, 1, 0) --> "muffins"
(root.tags<arr>, 1, 1, 1) --> "cupcakes"
(root.tags<arr>, 2, 2, 0) --> "curry"
(root.tags<arr>, 2, 2, 1) --> "sauces" col_path is the path of the array field, eg, "root.tags" for elements of the tag array, or "root.tags.text" for the "text" field of the object in the tag array. join_key and index are indexes of the foreign key and value of the array. The tid is also stored, as it is not necessary to store a separate entry in BI for each sequence. You can use tid to look up values for corresponding columns in the same object. Consider an object that represents a hashtag in separate tweets.
1: {"id": 3465345, "tags": ["muffins""cupcakes"]}
2: {"id": 3465376, "tags": ["curry""sauces"]}
For these, the tag table has the following schema:
Root.tags <arr> (id_jk: join_key, index: int, val: string)
For this table, the entries in the AI are as follows:
(root.tags <arr>, 1, 1, 0)->"muffins"
(root.tags <arr>, 1, 1, 1)->"cupcakes"
(root.tags <arr>, 2, 2, 0)->"curry"
(root.tags <arr>, 2, 2, 1)->"sauces"

配列インデックスは、配列フィールドの値を迅速に反復処理することを可能にする。これは、例えば、これらのフィールド全体に統計（例えば、合計、平均、分散等）を実行する場合、特定の値を見付ける場合などに有用である。 Array index allows to quickly iterate over the values of array fields. This is useful, for example, when finding statistics (eg, sums, averages, variances, etc.) over these fields, finding specific values, etc.

ネストされた配列例
ルートオブジェクト内の配列（トップレベルの配列）については、ｔｉｄ及びｊｏｉｎ＿ｋｅｙフィールドは冗長であり（上記を参照）、最適化により除去できることに留意する。しかしながら、ネストされた配列については、別個のｊｏｉｎ＿ｋｅｙが必要であり、余分ではない。例えば、下記のＪＳＯＮオブジェクトを考える。
1: {"id": 3598574, "tags": [[8,25,75], ["muffins", "donuts",
"pastries"〕〕}
対応する関係スキーマは、
Root.tags<arr>(id_jk: join_key, index: int, val<arr>: join_key)
Root.tags<arr>.val<arr>(id_jk: join_key, index: int, val<num>:
num, val<str>: str)
である。
ＡＩは、下記のキーと値のペアを使用することを思い起こそう。
col_path, tid, join_key, index -> val
下記のＡＩエントリが生じる。
tags<arr>.val<arr>, 1, 1, 0 -> 1
tags<arr>.val<arr>, 1, 1, 1 -> 2
(numbers array)
tags<arr>.val<arr>.val<num>, 1, 1, 0 -> 8
tags<arr>.val<arr>.val<num>, 1, 1, 1 -> 25
tags<arr>.val<arr>.val<num>, 1, 1, 2 -> 75
(string array)
tags<arr>.val<arr>.val<str>, 1, 2, 0 -> "muffins"
tags<arr>.val<arr>.val<str>, 1, 2, 1 -> "donuts"
tags<arr>.val<arr>.val<str>, 1, 2, 2 -> "pastries" Nested Array Examples Note that for arrays in the root object (top-level arrays), the tid and join_key fields are redundant (see above) and can be eliminated by optimization. However, for nested arrays, separate join_keys are required and not redundant. For example, consider the following JSON object:
1: {"id": 3598574, "tags": [[8, 25, 75], ["muffins", "donuts",
"pastries"]]}
The corresponding relationship schema is
Root.tags <arr> (id_jk: join_key, index: int, val <arr>: join_key)
Root.tags <arr> .val <arr> (id_jk: join_key, index: int, val <num>:
num, val <str>: str)
It is.
Recall that AI uses the following key-value pairs:
col_path, tid, join_key, index-> val
The following AI entry occurs.
tags <arr> .val <arr>, 1, 1, 0-> 1
tags <arr> .val <arr>, 1, 1, 1-> 2
(numbers array)
tags <arr> .val <arr> .val <num>, 1, 1, 0-> 8
tags <arr> .val <arr> .val <num>, 1, 1, 1-> 25
tags <arr> .val <arr> .val <num>, 1, 1, 2-> 75
(string array)
tags <arr> .val <arr> .val <str>, 1, 2, 0->"muffins"
tags <arr> .val <arr> .val <str>, 1, 2, 1->"donuts"
tags <arr> .val <arr> .val <str>, 1, 2, 2->"pastries"

結合キーが、ネストされた配列キーと値のペアから取り除かれた場合、マフィン（ｍｕｆｆｉｎｓ）が第１のネストされた配列の一部であったか、第２のネストされた配列の一部であったかが分からなくなることに留意する。従って、結合キーは、トップレベルの配列には冗長であるが、ネストされた配列の場合には冗長ではない。 If the join key is removed from the nested array key-value pair, then whether the muffins were part of the first nested array or were part of the second nested array Keep in mind that you will not understand. Thus, the join key is redundant for top-level arrays but not for nested arrays.

配列インデックス（ＡＩ２：ＡｒｒａｙＩｎｄｅｘ２）
上記２つのインデックス（ＢＩ及びＡＩ）は、採集されたデータ全てを再構築するには十分であるが、この２つのインデックスが効率的にサポートしないアクセスパターンがある。そのアクセスパターンのために、下記のインデックスを導入する。下記のインデックスは、追加スペースを犠牲にして性能改善のためにオプションで作成することができる。 Array index (AI2: Array Index 2)
Although the above two indexes (BI and AI) are sufficient to reconstruct all the collected data, there are access patterns that these two indexes do not support efficiently. The following index is introduced for the access pattern. The following indexes can optionally be created to improve performance at the expense of additional space.

これは、署名
(col_path, index, tid, join_key) -> val
を有し、この署名によって、配列の特定のインデックス要素を迅速に見付けることが可能になる。例えば、インデックス１０（タグ［１０］）で全てのタグの戻すことは、ＡＩ２を用いると、簡単で高速である。 This is a signature
(col_path, index, tid, join_key)-> val
This signature makes it possible to quickly find a particular index element of the array. For example, returning all tags with index 10 (tags [10]) is simple and fast with AI2.

マップインデックス（ＭＩ：ＭａｐＩｎｄｅｘ）
マップインデックスは、その機能及び署名、すなわち、
(col_path, tid, join_key, map_key) -> val
において配列インデックスに類似する。 Map index (MI: Map Index)
A map index has its function and signature, ie
(col_path, tid, join_key, map_key)-> val
Similar to array index in.

主な違いは、マップインデックスは、初期採集中に構築されず、非同期的に構築されることである。初期ロード中、マップは、オブジェクトとして扱われ、通常通り、ＢＩに挿入される。一旦、両方がポピュレートされると、より効率的なクエリ処理のためにＢＩとＭＩの両方のエントリが利用可能になる。ＢＩエントリは、ユーザまたはアドミニストレータがマップを装飾しないように要求する場合には、関連したままである。関係スキーマだけが、変更を必要とされ、マップされないデータに対応する元のＢＩエントリは、その後、クエリにおいて使用される。 The main difference is that map indexes are not built during initial collection but are built asynchronously. During initial loading, the map is treated as an object and inserted into BI as usual. Once both are populated, both BI and MI entries will be available for more efficient query processing. A BI entry remains relevant if the user or administrator requests not to decorate the map. Only the relationship schemas need to be changed, and the original BI entries corresponding to the unmapped data are then used in the query.

ＡＩ同様、ＭＩは、統計関数の適用、特定のフィールド名への制限などのために、マップ要素を反復処理するときに有用である。ページビュー統計を維持するオブジェクトを再度考える。
1: { "url": "noudata.com/blog",
"page_views": { "2012-12-01": 10, "2012-12-02": 12, ...
"2012-12-15": 10 }
2: { "url": "noudata.com/jobs",
"page_views": { "2012-12-01": 2, "2012-12-02": 4, ... "2012-
12-15": 7 }
マップとしてフラグを立てられる場合、ｐａｇｅ＿ｖｉｅｗｓテーブルについての関係スキーマは、
Root.page_views<map>(id_jk: join_key, key: string, val: num)
where key is the map's key and val is the associated value. For
the above objects, the entries in the MI would be:
(root.page_views<map>, 1, 1, "2012-12-01") --> 10
(root.page_views<map>, 1, 1, "2012-12-02") --> 12
…
(root.page_views<map>, 1, 1, "2012-12-15") --> 10
(root.page_views<map>, 2, 2, "2012-12-01") --> 2
(root.page_views<map>, 2, 2, "2012-12-02") --> 4
…
(root.page_views<map>, 2, 2, "2012-12-05") --> 7
である。
この順序付けは、ｐａｇｅ＿ｖｉｅｗｓマップ内の値が各ページについて併置されることを可能にする一方、ＢＩでは、値が日付によって併置されることになる。 Like AI, MI is useful when iterating map elements due to the application of statistical functions, restrictions on particular field names, etc. Think again of the objects that maintain page view statistics.
1: {"url": "noudata.com/blog",
"page_views": {"2012-12-01": 10, "2012-12-02": 12, ...
"2012-12-15": 10}
2: {"url": "noudata.com/jobs",
"page_views": {"2012-12-01": 2, "2012-12-02": 4, ... "2012-
12-15 ": 7}
If flagged as a map, the relationship schema for the page_views table is
Root.page_views <map> (id_jk: join_key, key: string, val: num)
where key is the map's key and val is the associated value. For
the above objects, the entries in the MI would be:
(root.page_views <map>, 1, 1, "2012-12-01")-> 10
(root.page_views <map>, 1, 1, "2012-12-02")-> 12
...
(root.page_views <map>, 1, 1, "2012-12-15")-> 10
(root.page_views <map>, 2, 2, "2012-12-01")-> 2
(root.page_views <map>, 2, 2, "2012-12-02")-> 4
...
(root.page_views <map>, 2, 2, "2012-12-05")-> 7
It is.
This ordering allows the values in the page_views map to be collocated for each page, while in BI the values will be collocated by date.

マップインデックス２（ＭＩ２：ＭａｐＩｎｄｅｘ２）
さらに、補助マップインデックスを実装してよい。補助マップインデックスは、その機能及び署名、すなわち、
(col_path, map_key, tid, join_key) -> val
において配列インデックスに類似する。
これによって、“ａｌｌｔｈｅｄｉｆｆｅｒｅｎｔｖａｌｕｅｓｃｏｒｅｓｐｏｎｄｉｎｇｔｏｍａｐｋｅｙ２０１２‐１２‐０５”などの、特定のマップ要素についての効率的な検索が可能になる。ＡＩ２とＭＩ２の両方の包括的表現は、以下のように書くことができる。
(col_path, key, tid, join_key) -> val
ここで、キーは、配列のインデックスまたはマップのｍａｐ＿ｋｅｙに対応する。 Map index 2 (MI2: Map Index 2)
In addition, auxiliary map indexes may be implemented. The auxiliary map index has its function and signature, ie
(col_path, map_key, tid, join_key)-> val
Similar to array index in.
This allows an efficient search for specific map elements, such as "all the different values cores corresponding to map key 2012-12-05". The generic representation of both AI2 and MI2 can be written as:
(col_path, key, tid, join_key)-> val
Here, the key corresponds to the array index or map_key of the map.

値インデックス（ＶＩ：ＶａｌｕｅＩｎｄｅｘ）
上記インデックスは、特定のフィールドの値をルックアップし、それらの値を反復処理するのに有用であるが、クエリが特定の値または特定の値の範囲だけを探している場合、高速アクセスを可能にしない。例えば、クエリが、“ｍａｓｈａｈ０８”によって書かれたツイートのテキストを返すこと求めるとする。そのようなクエリを支援するために、スキーマ内のフィールドの一部または全てについてＶａｌｕｅＩｎｄｅｘを構築することができる。ＶａｌｕｅＩｎｄｅｘは、データが採集される際に構築されてもよく、後に非同期的に構築されてもよい。値インデックスのキーは、
(col_path, val)
であり、ここで、ｖａｌは、ソースデータにおける属性の値である。ＶＩにおいてそのキーに対応する値は、どこでその値についてのフィールドが発生するかに応じて決まる。上記各インデックスについて、その値は下記のように変わる。
BI: (col_path, val) -> tid
AI: (col_path, val) -> tid, join_key, index
MI: (col_path, val) -> tid, join_key, key Value index (VI: Value Index)
The above index is useful for looking up specific field values and iterating over those values, but allows fast access if the query is only looking for a specific value or a range of specific values Do not For example, suppose a query asks to return the text of a tweet written by "mashah08". To support such queries, ValueIndex can be constructed for some or all of the fields in the schema. The ValueIndex may be constructed as data is collected, and may be constructed asynchronously later. The key of the value index is
(col_path, val)
Where val is the value of the attribute in the source data. The value corresponding to that key in VI depends on where the field for that value occurs. The value of each index changes as follows.
BI: (col_path, val)-> tid
AI: (col_path, val)-> tid, join_key, index
MI: (col_path, val)-> tid, join_key, key

例えば、ツイート
1: { "text": "Tweet this", "user": { "id": 29471497,
"screen_name": "mashah08" } }
2: { "text": "Tweet that", "user": { "id": 27438992,
"screen_name": "binkert" } }
は、
(root.text<string>, "Tweet this") --> 1
(root.text<string>, "Tweet that") --> 2
(root.user.id<num>, 29471497) --> 1
(root.user.id<num>, 27438992) --> 2
(root.user.screen_name<string>, "Mashah08") --> 1
(root.user.screen_name<string>, "binkert") --> 2
として記憶される。
ＶＩを用いると、キー：（ｒｏｏｔ．ｕｓｅｒ．ｓｃｒｅｅｎ＿ｎａｍｅ，“ｍａｓｈａｈ０８”）を探し、全ての関連ｔｉｄを取り出すことにより、“ｍａｓｈａｈ０８”によって書かれた全てのツイートについて検索することができる。そして、各ツイートの対応するテキストを返すために、取り出したｔｉｄを使用してＢＩを検索することができる。インデックス、特に、値インデックスによって犠牲になるのは、追加のストレージスペースであり、新しいオブジェクトとしてインデックスの更新に必要な実行時間がシステムに追加される。スペースまたは更新オーバーヘッドが原因で、ユーザは、全ての可能な経路にインデックスを付けることを望まない場合がある。従って、ユーザは、ＶＩにおいてどの経路にインデックスを付けるかを指定することができる。 For example, Tweet
1: {"text": "Tweet this", "user": {"id": 29471497,
"screen_name": "mashah08"}}
2: {"text": "Tweet that", "user": {"id": 27438992,
"screen_name": "binkert"}}
Is
(root.text <string>, "Tweet this")-> 1
(root.text <string>, "Tweet that")-> 2
(root.user.id <num>, 29471497)-> 1
(root.user.id <num>, 27438992)-> 2
(root.user.screen_name <string>, "Mashah08")-> 1
(root.user.screen_name <string>, "binkert")-> 2
Is stored as
Using VI, it is possible to search for all tweets written by "mashah08" by searching for the key: (root.user.screen_name, "mashah08") and retrieving all relevant tids. The retrieved tid can then be used to search the BI to return the corresponding text of each tweet. It is the extra storage space that is sacrificed by the index, especially the value index, which adds to the system the execution time needed to update the index as a new object. Due to space or update overhead, users may not want to index all possible paths. Thus, the user can specify which path to index in the VI.

ローインデックス（ＲＩ：ＲａｗＩｎｄｅｘ）
（従来のローベースのストアにおけるレコードの要求に類似して）採集したオブジェクト全体の再作成を容易にするために、ＲｏｗＩｎｄｅｘ（ＲＩ）を実装することができる。ＲｏｗＩｎｄｅｘは、キーと値のペア、
tid --> ＪＳＯＮ object
として記憶される。 Low index (RI: RawIndex)
A RowIndex (RI) can be implemented to facilitate the re-creation of the entire collected object (similar to the request for records in a conventional row based store). RowIndex is a key-value pair,
tid-> JSON object
Is stored as

ＪＳＯＮオブジェクトは、文字列表現として、ＢＳＯＮとして、または、ＪＳＯＮオブジェクトの内部表現のために使用されるツリー構造などの任意の他の直列化フォーマットとして、記憶されてよい。ＶＩに関して上述した２つのツイートについては、対応するＲＩエントリは、
1 --> { "text": "Tweet this", "user": { "id": 29471497,
"screen_name": "mashah08" } }
2 --> { "text": "Tweet that", "user": { "id": 27438992,
"screen_name": "binkert" } }
である。 The JSON object may be stored as a string representation, as BSON, or as any other serialization format such as a tree structure used for internal representation of the JSON object. For the two tweets mentioned above with respect to VI, the corresponding RI entry is
1->{"text":"Tweetthis","user":{"id": 29471497,
"screen_name": "mashah08"}}
2->{"text":"Tweetthat","user":{"id": 27438992,
"screen_name": "binkert"}}
It is.

実施例
ＢＩ、ＡＩ、ＭＩ、ＶＩに関する実施例。上記に類似したツイートを考える。ここで、ツイートが一日に何回リツイートされたかを追跡する“ｒｅｔｗｅｅｔ＿ｆｒｅｑ”属性が追加される。
1: { "text": "Love #muffins and #cupcakes: bit.ly/955Ffo",
"user": { "id": 29471497, "screen_name": "mashah08" },
"tags": [ "muffins", "cupcakes" ],
"retweet_freq": { "2012-12-01": 10, "2012-12-02": 13,
"2012-12-03": 1 } }
2: { "text": "Love #sushi and #umami: bit.ly/955Ffo",
"user": { "id": 28492838, "screen_name": "binkert" },
"tags": [ "sushi", "umami" ],
"retweet_freq": { "2012-12-04": 20, "2012-12-05": 1 } } Examples Examples regarding BI, AI, MI, VI. Consider a tweet similar to the above. Here, a "retweet_freq" attribute is added to track how many times a tweet was retweeted per day.
1: {"text": "Love #muffins and #cupcakes: bit.ly/955Ffo",
"user": {"id": 29471497, "screen_name": "mashah08"},
"tags": ["muffins", "cupcakes"],
"retweet_freq": {"2012-12-01": 10, "2012-12-02": 13,
"2012-12-03": 1}}
2: {"text": "Love #sushi and #umami: bit.ly/955Ffo",
"user": {"id": 28492838, "screen_name": "binkert"},
"tags": ["sushi", "umami"],
"retweet_freq": {"2012-12-04": 20, "2012-12-05": 1}}

これらのレコードのスキーマは、
Ｏ{ "text": string, "user": 0{ "id": number,
"screen_name": string }, "tags": [ string ],
"retweet_freq": M{ "2012-12-01": number ... "2012-12-05":
number } }
である。 The schema of these records is
O {"text": string, "user": 0 {"id": number,
"screen_name": string}, "tags": [string],
"retweet_freq": M {"2012-12-01": number ... "2012-12-05":
number}}
It is.

これらのレコードのＪＳＯＮスキーマは、
{
"type": "object",
"obj_type": "object",
"properties" : {
"text" : {
"type": "string"
},
"user": {
"type": "object",
"obj_type": "object",
"properties": {
"id": {
"type": "number",
}
"screen_name": {
"type": "string",
}
}
},
"tags": {
"type": "array",
"items": {
"type": "string"
}
},
"retweet_freq" : {
"type": "object",
"obj_type": "map",
"properties": {
"2012-12-01": {
"type": "number"
},
…
"2012-12-05": {
"type": "number"
}
}
}
}
}
となる。 The JSON schema of these records is
{
"type": "object",
"obj_type": "object",
"properties": {
"text": {
"type": "string"
},
"user": {
"type": "object",
"obj_type": "object",
"properties": {
"id": {
"type": "number",
}
"screen_name": {
"type": "string",
}
}
},
"tags": {
"type": "array",
"items": {
"type": "string"
}
},
"retweet_freq": {
"type": "object",
"obj_type": "map",
"properties": {
"2012-12-01": {
"type": "number"
},
...
"2012-12-05": {
"type": "number"
}
}
}
}
}
It becomes.

ｒｅｔｗｅｅｔ＿ｆｒｅｑがマップとして扱われない場合、関係スキーマは、
Root (text: str,
user.id: num, user.screen_name: str,
tags<arr>: join_key,
retweet_freq.2012-12-01: num,
retweet_freq.2012-12-02: num,
retweet_freq.2012-12-03: num,
retweet_freq.2012-12-04: num,
retweet_freq.2012-12-05: num)
Root.tags<arr> (id_jk: join_key,
index: int,
val: str)
である。 If retweet_freq is not treated as a map, the relationship schema is
Root (text: str,
user.id: num, user.screen_name: str,
tags <arr>: join_key,
retweet_freq. 2012-12-01: num,
retweet_freq. 2012-12-02: num,
retweet_freq. 2012-12-03: num,
retweet_freq. 2012-12-04: num,
retweet_freq. 2012-12-05: num)
Root.tags <arr> (id_jk: join_key,
index: int,
val: str)
It is.

この場合、上記レコード例は、これらの関係を下記のようにポピュレートすることになる。
Root:
("Love #muffins ...", 29471497, mashah08, 1, 10, 13, 1, NULL,
NULL)
("Love #sushi ...", 28492838, binkert, 2, NULL, NULL, NULL,
20, 1)
Root.tags<arr>:
(1, 0, "muffins")
(1, 1, "cupcakes")
(2, 0, "sushi")
(2, 1, "umami") In this case, the example record will populate these relationships as follows.
Root:
("Love #muffins ...", 29471497, mashah08, 1, 10, 13, 1, NULL,
NULL)
("Love #sushi ...", 28492838, binkert, 2, NULL, NULL, NULL,
20, 1)
Root.tags <arr>:
(1, 0, "muffins")
(1, 1, "cupcakes")
(2, 0, "sushi")
(2, 1, "umami")

これらは、“ｓｅｌｅｃｔ^*”クエリがこれらのテーブル上で実行される場合にクエリが戻すタプルであることに留意する。これらのタプルは、ストレージエンジンにおいて必ずしもそのように具体化されない。すなわち、これは、基礎的データ上での単なる仮想ビューであってよく、示したように物理的に記憶されなくてもよい。 Note that these are the tuples that the query returns if the "select ^* " query is executed on these tables. These tuples are not necessarily instantiated in the storage engine. That is, it may be just a virtual view on the underlying data and may not be physically stored as shown.

ｒｅｔｗｅｅｔ＿ｆｒｅｑがマップとして識別される場合、関係スキーマは、以下のように、より簡潔（及び追加データに更に協調的(accommodating)）になる。
Root (text: str,
user.id: num, user.screen_name: str,
tags<arr>: join_key,
retweet_freq<map>: join_key)
Root.tags<arr> (id_jk: join_key,
index: int,
val: str)
Root.retweet_freq<map> (id_jk: join_key,
key: str,
val: num) If retweet_freq is identified as a map, the relationship schema becomes simpler (and more accommodating additional data) as follows.
Root (text: str,
user.id: num, user.screen_name: str,
tags <arr>: join_key,
retweet_freq <map>: join_key)
Root.tags <arr> (id_jk: join_key,
index: int,
val: str)
Root.retweet_freq <map> (id_jk: join_key,
key: str,
val: num)

対応するタプルは、
Root:
("Love #muffins ...", 29471497, mashah08, 1, 1)
("Love #sushi ...", 28492838, binkert, 2, 2)
Root.tags<arr>:
(1, 0, "muffins")
(1, 1, "cupcakes")
(2, 0, "sushi")
(2, 1, "umami")
Root.retweet_freq<map>:
(1, "2012-12-01", 10)
(1, "2012-12-02", 13)
(1, "2012-12-03", 1)
(2, "2012-12-04", 20)
(2, "2012-12-05", 1)
である。 The corresponding tuple is
Root:
("Love #muffins ...", 29471497, mashah08, 1, 1)
("Love #sushi ...", 28492838, binkert, 2, 2)
Root.tags <arr>:
(1, 0, "muffins")
(1, 1, "cupcakes")
(2, 0, "sushi")
(2, 1, "umami")
Root.retweet_freq <map>:
(1, "2012-12-01", 10)
(1, "2012-12-02", 13)
(1, "2012-12-03", 1)
(2, "2012-12-04", 20)
(2, "2012-12-05", 1)
It is.

ＢＩに追加されるキーと値のペアは、
(root.retweet_freq.2012-12-01, 1) --> 10
(root.retweet_freq.2012-12-02, 1) --> 13
(root.retweet_freq.2012-12-03, 1) --> 1
(root.retweet_freq.2012-12-04, 2) --> 20
(root.retweet_freq.2012-12-05, 2) --> 1
(root.text, 1) --> "Love #muffins and #cupcakes"
(root.text, 2) --> "Love #sushi and #umami"
(root.uer.id, 1) --> 29471497
(root.user.id, 2) --> 28492838
(root.user.screenname, 1) --> mashah08
(root.user.screen_name, 2) --> binkert
である。 The key-value pairs added to BI are
(root.retweet_freq.2012-12-01, 1)-> 10
(root.retweet_freq. 2012-12-02, 1)-> 13
(root.retweet_freq. 2012-12-03, 1)-> 1
(root.retweet_freq. 2012-12-04, 2)-> 20
(root.retweet_freq. 2012-12-05, 2)-> 1
(root.text, 1)->"Love#muffins and #cupcakes"
(root.text, 2)->"Love#sushi and #umami"
(root.uer.id, 1)-> 29471497
(root.user.id, 2)-> 28492838
(root.user.screenname, 1)-> mashah08
(root.user.screen_name, 2)-> binkert
It is.

ＡＩに追加されるキーと値のペアは下記のようになる。この場合、ネストされた配列が無いので、結合キーは（ｔｉｄと同様に）冗長であることに留意する。
(root.tags<arr>, 1, 1, 0) --> "muffins"
(root.tags<arr>, 1, 1, 1) --> "cupcakes"
(root.tags<arr>, 2, 2, 0) --> "sushi"
(root.tags<arr>, 2, 2, 1) --> "umami" The key-value pairs added to AI are as follows: Note that in this case the join key is redundant (like tid) since there are no nested sequences.
(root.tags <arr>, 1, 1, 0)->"muffins"
(root.tags <arr>, 1, 1, 1)->"cupcakes"
(root.tags <arr>, 2, 2, 0)->"sushi"
(root.tags <arr>, 2, 2, 1)->"umami"

ＲＩは、次の２つのエントリを有することになる。
1--> { "text": "Love #muffins and #cupcakes: bit.ly/955Ffo",
"user": { "id": 29471497, "screen_name": "mashah08" },
"tags": [ "muffins", "cupcakes" ], "retweet_freq": { "2012-
12-01": 10, "2012-12-02": 13, "2012-12-03": 1 } }
2--> { "text": "Love #sushi and #umami: bit.ly/955Ffo", "user":
{ "id": 28492838, "screen_name": "binkert" }, "tags": [
"sushi", "umami" 〕, "retweet_freq": { "2012-12-04": 20,
"2012-12-05": 1 } } The RI will have the following two entries.
1->{"text":"Love#muffins and #cupcakes: bit.ly/955Ffo",
"user": {"id": 29471497, "screen_name": "mashah08"},
"tags": ["muffins", "cupcakes"], "retweet_freq": {"2012-
12-01 ": 10," 2012-12-02 ": 13," 2012-12-03 ": 1}}
2->{"text":"Love#sushi and #umami: bit.ly/955Ffo", "user":
{"id": 28492838, "screen_name": "binkert"}, "tags": [
"sushi", "umami"], "retweet_freq": {"2012-12-04": 20,
"2012-12-05": 1}}

もしＭＩが構築される場合には、ＭＩは、下記のエントリを有することになる。
(root.retweet_freq<map>, 1, 1, "2012-12-01") --> 10
(root.retweet_freq<map>, 1, 1, "2012-12-02") --> 13
(root.retweet_freq<map>, 1, 1, "2012-12-03") --> 1
(root.retweet_freq<map>, 2, 2, "2012-12-04") --> 20
(root.retweet_freq<map>, 2, 2, "2012-12-05") --> 1 If an MI is built, the MI will have the following entry:
(root.retweet_freq <map>, 1, 1, "2012-12-01")-> 10
(root.retweet_freq <map>, 1, 1, "2012-12-02")-> 13
(root.retweet_freq <map>, 1, 1, "2012-12-03")-> 1
(root.retweet_freq <map>, 2, 2, "2012-12-04")-> 20
(root.retweet_freq <map>, 2, 2, "2012-12-05")-> 1

同様に、ＶＩは、（全経路がインデックス付けされ、マップがマップのように扱われる場合）下記のエントリを有することになる。
root. retweet_freq<map>, 1) - -> 2, 2, "2012-12-05
root. retweet_freq<map>, 1) - -> 1, 1, "2012-12-03
root. retweet_freq<map>, 10) - -> 1, 1, "2012-12-01
root. retweet_freq<map>, 13) - -> 1, 1, "2012-12-02
root. retweet_freq<map>, 20) - -> 2, 2, "2012-12-04
root. tags<arr>, "cupcakes") - -> 1, 1, 1
root. tags<arr>, "muffins ") - -> 1, 1, 0
root. tags<arr>, "sushi") - -> 2, 2, 0
root. tags<arr>, "umami") - -> 2, 2, 1
(root.text<str>, "Love #muffins and #cupcakes")- -> 1
(root.text<str>, "Love #sushi and #umami") - -> 2
(root.user.id, 29471497) - -> 1
(root.user.id, 28492838) - -> 2
(root.user.screen_name, "mashah08") - -> 1
(root.user.screen_name, "binkert") - -> 2 Similarly, a VI will have the following entry (if all paths are indexed and the map is treated like a map):
root. retweet_freq <map>, 1)--> 2, 2, "2012-12-05
root. retweet_freq <map>, 1)--> 1, 1, "2012-12-03
root. retweet_freq <map>, 10)--> 1, 1, "2012-12-01
root. retweet_freq <map>, 13)--> 1, 1, "2012-12-02
root. retweet_freq <map>, 20)--> 2, 2, "2012-12-04
root. tags <arr>, "cupcakes")--> 1, 1, 1
root. tags <arr>, "muffins")--> 1, 1, 0
root. tags <arr>, "sushi")--> 2, 2, 0
root. tags <arr>, "umami")--> 2, 2, 1
(root.text <str>, "Love #muffins and #cupcakes")--> 1
(root.text <str>, "Love #sushi and #umami")--> 2
(root.user.id, 29471497)--> 1
(root.user.id, 28492838)--> 2
(root.user.screen_name, "mashah08")--> 1
(root.user.screen_name, "binkert")--> 2

上記アクションは、段階的に記載するが、単一パスでの採集を可能にするためにパイプライン化することができ、ＢＩ、ＡＩ、ＲＩをロードし、ＪＳＯＮスキーマを計算する。他のインデックスは、非同期的に構築することができ、また、求めに応じて有効または無効にすることができる。 The above actions are described step by step, but can be pipelined to allow single pass collection, load BI, AI, RI, and compute JSON schema. Other indexes can be built asynchronously and can be enabled or disabled as required.

システムアーキテクチャ
分析プラットフォームは、サービス指向であるように設計される。様々な実装において、５つの主なサービス、すなわち、プロキシ、メタデータサービス、クエリエクゼキュータ、ストレージサービス、及び、採集サービスがある。 System Architecture The analysis platform is designed to be service oriented. In various implementations, there are five main services: proxy, metadata service, query executor, storage service, and collection service.

この分離アプローチは、幾つかの長所を有し得る。これらのサービスは、外部ＡＰＩ（遠隔手順呼び出し）だけを介して通信するので、サービスを多重化することができ、各サービスを独立して共有することができる。例えば、複数のプロキシが、エクゼキュータ毎に使用されてよく、また、複数のエクゼキュータが、ストレージサービス毎に使用されてよい。メタデータサービスも、エクゼキュータ及びストレージサービスの複数のインスタンス間で共有することができる。 This separation approach may have several advantages. Since these services communicate only through external APIs (remote procedure calls), services can be multiplexed and services can be shared independently. For example, multiple proxies may be used per executor, and multiple executors may be used per storage service. Metadata services can also be shared among multiple instances of executor and storage services.

エクゼキュータ、ストレージ及び採集サービスは、並列化され、プライベートまたはパブリックインフラストラクチャにおいて、仮想マシンインスタンスの個々の部分を実行することができる。これは、これらのサービスを独立的に中断、スケーリングすることを可能にする。これは、需要の変動に基づいてサービス容量を調整することによって費用を削減するのに有用である。例えば、パブリッククラウドの弾力性を用いて、高速の夜通しのロードのための採集サービスを高度に並列化することができる一方、日々のクエリ作業負荷については、実行及びストレージサービスのサイズを削減したままにする。 The executor, storage and collection services are parallelized and can execute individual parts of virtual machine instances in a private or public infrastructure. This allows these services to be suspended and scaled independently. This is useful to reduce costs by adjusting service capacity based on fluctuations in demand. For example, the elasticity of the public cloud can be used to highly parallelize the collection service for fast overnight loads, while reducing the size of the execution and storage service for daily query workloads Make it

プロキシは、クライアントへのゲートウェイであり、ＯＤＢＣ（ＯｐｅｎＤａｔａｂａｓｅＣｏｎｎｅｃｔｉｖｉｔｙ）、ｌｉｂｐｑ、ＪＤＢＣ（Ｊａｖａ（登録商標）ＤａｔａｂａｓｅＣｏｎｎｅｃｔｉｖｉｔｙ）、ＳＳＬ（ｓｅｃｕｒｅｓｏｃｋｅｔｓｌａｙｅｒ）等の、１つまたは複数の標準プロトコルをサポートする。ゲートウェイは、ファイアウォール、認証サービス、及び、内部サービスのための制御の場所として機能する。例えば、実行のサポート及びストレージサービスを費用節約のために中断している間、クライアント接続（ネットワークソケット等）は、プロキシでオープンに保つことができる。クライアント接続が再びアクティブになると、必要なサービスは、比較的短い始動待ち時間でオンデマンドに呼び起こすことができる。 A proxy is a gateway to a client and supports one or more standard protocols such as ODBC (Open Database Connectivity), libpq, JDBC (Java Database Connectivity), SSL (secure sockets layer) and the like. The gateway acts as a place of control for firewalls, authentication services, and internal services. For example, client connections (such as network sockets) can be kept open at the proxy while interrupting execution support and storage services for cost savings. When the client connection is reactivated, the required services can be invoked on demand with relatively short start-up latency.

メタデータサービスは、他のサービスの多くのインスタンスによって典型的に共有される。メタデータサービスは、スキーマ、ソース情報、パーティション化情報、クライアントユーザ名、キー、統計（ヒストグラム、値分布等）、及び、各サービスの現在の状態に関する情報（インスタンスの数、ＩＰアドレス等）を含むメタデータを記憶する。 Metadata services are typically shared by many instances of other services. Metadata services include schema, source information, partitioning information, client user name, keys, statistics (histogram, value distribution etc), and information on the current state of each service (number of instances, IP address etc etc) Store metadata

ストレージサービスは、インデックスを管理し、読み出し及び書き込み要求を満たす。さらに、クエリエクゼキュータは、多くの関数をストレージサービスにプッシュダウンすることができる。様々な実装において、ストレージサービスは、結果をフィルタにかけるために述語及びＵＤＦ（ユーザが定義した関数）を評価し、（例えば、オブジェクトを再構築するために）ローカル結合を評価し、プッシュダウンされた結合（例えば、ブロードキャスト結合）を評価し、ローカル集計を評価することができる。 Storage services manage indexes and satisfy read and write requests. In addition, the query executor can push down many functions to the storage service. In various implementations, the storage service evaluates predicates and UDFs (user-defined functions) to filter results, evaluates local bindings (eg, to reconstruct objects), and is pushed down. Bindings (eg, broadcast bindings) can be evaluated and local aggregations can be evaluated.

ストレージサービスは、パーティション並列処理と呼ばれる技術を用いて並列化することができる。このアプローチでは、ストレージサービスの非常に多くのインスタンスまたはパーティションが作成され、採集されたオブジェクトはパーティション間で分けられる。各パーティションは、インデックスの各型を、単一のインスタンス全体であるかのように記憶するが、各パーティションは、採集されたデータのサブセットにインデックスを付けるだけである。 Storage services can be parallelized using a technique called partition parallel processing. In this approach, numerous instances or partitions of the storage service are created, and the harvested objects are divided among the partitions. Each partition stores each type of index as if it were an entire single instance, but each partition only indexes a subset of the collected data.

分析エンジンは、１つまたは複数のパーティション化戦略をサポートする。簡単で有効な戦略は、ｔｉｄによってオブジェクトをパーティション化し、ローカルインデックス内に各オブジェクトのエントリを記憶することである。このように、採集されたオブジェクトは、異なるインスタンス間で分割されない。異なるインスタンス間で分割されると、クエリがオブジェクトの複数の部分に依存する場合、かなりのネットワーク帯域幅を消費し得る。ｔｉｄは、ハッシュ割り当て、ラウンドロビン、または、範囲ベースの割り当てを含む、多くの手法で割り当てることができる。これらの個々の割り当ては、全てのインスタンス間に直近のデータを分散することによって、負荷を分散させる。 The analysis engine supports one or more partitioning strategies. A simple and effective strategy is to partition objects by tid and store an entry for each object in the local index. In this way, the collected objects are not split between different instances. When split between different instances, considerable network bandwidth may be consumed if the query relies on multiple parts of the object. tid can be assigned in many ways, including hash assignment, round robin or range based assignment. These individual assignments distribute the load by distributing the most recent data among all the instances.

別の戦略は、ユーザｉｄまたはセッションｉｄなどの別のフィールド値（またはフィールド値の組み合わせ）によってパーティション化することである。代替のフィールド（カラム）のパーティション化によって、他のテーブルまたはコレクション、例えば、ユーザプロファイル、とローカル結合を行うのが便利になる。パーティション化戦略は、ハッシュパーティション化であってもよく、サンプリング及び範囲パーティション化を用いてもよい。前者は、効率的なポイントルックアップのために使用され、後者は、効率的な範囲検索をサポートするために使用される。 Another strategy is to partition by another field value (or combination of field values) such as user id or session id. Partitioning of alternate fields (columns) makes it convenient to do local joins with other tables or collections, eg user profiles. The partitioning strategy may be hash partitioning, and may use sampling and range partitioning. The former is used for efficient point lookup and the latter is used to support efficient range search.

パーティション化戦略に関わらず、オブジェクトまたはオブジェクトの任意のサブセットは、ローカルに再構築できるべきである。従って、ストレージサービスパーティションは、クエリ処理中、クロストークがなく、ＡＰＩを介した実行サービスとの通信だけを必要とする。 Regardless of the partitioning strategy, an object or any subset of objects should be able to be reconstructed locally. Thus, the storage service partition is free of crosstalk during query processing and only needs to communicate with the execution service via the API.

ストレージサービスはキャッシュを有する。各パーティションにおけるキャッシュサイズを増やすことができ、あるいは、ストレージサービスの性能を改善するためにパーティションの数を増やすことができる。ストレージサービスは、メモリ内、または、ローカルディスク上にインデックスをキャッシュすることができ、インデックスは、ＡｍａｚｏｎＳ３のような外部ストレージ上で存続することができる。この特徴によって、ストレージサービスノードの停止や破壊が可能になり、また、必要な場合はいつでも、それらを再配置することができる。さらに、この特徴は、高度の弾力性、すなわち、低コストでＳ３にストレージサービスをハイバネートし、かつ、需要変動に合わせてストレージサービス容量を変える能力、を可能にする。 Storage services have a cache. The cache size in each partition can be increased, or the number of partitions can be increased to improve storage service performance. Storage services can cache indexes in memory or on local disks, and indexes can persist on external storage such as Amazon S3. This feature allows storage service nodes to be shut down or destroyed, and they can be relocated whenever necessary. Furthermore, this feature enables a high degree of elasticity, ie the ability to hibernate storage services to S3 at low cost, and to change the storage service capacity to meet demand fluctuations.

クエリ実行サービスは、クエリ計画段階によって生成されたクエリ計画を実行する。クエリ実行サービスは、クエリ演算子、例えば、結合（ｊｏｉｎ）、統合（ｕｎｉｏｎ）、ソート（ｓｏｒｔ）、集計（ａｇｇｒｅｇａｔｉｏｎ）などを実装する。これらの演算の多くは、ストレージサービスにプッシュダウンすることができ、可能であればプッシュダウンされる。これらは、述語、射影、射影された属性を再構築するためのカラム型結合、並びに、ＧＲＯＵＰＢＹステートメントを用いる分散的及び代数的集計関数のための部分集計、を含む。 The query execution service executes the query plan generated by the query planning stage. The query execution service implements query operators such as join, union, sort, aggregation, and the like. Many of these operations can be pushed down to the storage service and possibly pushed down. These include predicates, projections, column-type joins to reconstruct projected attributes, and partial aggregations for distributed and algebraic aggregate functions using GROUP BY statements.

クエリ実行サービスは、ストレージサービスからデータを取り込み、非ローカル演算、すなわち、非ローカル結合、再パーティション化を必要とするＧＲＯＵＰＢＹステートメント、ソートなど、を計算する。エクゼキュータは、パーティション並列処理エクゼキュータに類似する。エグゼキュータは、交換演算子を使用して、クエリ実行ステップ間で再パーティション化を行い、中間結果を流出させるためのローカルストレージを採用する。多くのクエリについては、ストレージサービス内でクエリのほとんどを実行し、結果収集、及び、任意の小さな非ローカル演算のために単一のエクゼキュータノードだけを必要とすることが可能である。 The query execution service takes data from the storage service and calculates non-local operations, ie non-local joins, GROUP BY statements requiring repartitioning, sorting etc. The executor is similar to the partition parallel executor. The executor uses the exchange operator to repartition between query execution steps and employs local storage to flush out intermediate results. For many queries, it is possible to execute most of the queries in the storage service and only need a single executor node for result collection and any small non-local operations.

採集サービス
採集サービスは、半構造データをクエリできるストレージサービスに半構造データをロードすることを担当する。ユーザは、様々なプラットフォーム（例えば、ＭｏｎｇｏＤＢ、ＡｍａｚｏｎＳ３、ＨＤＦＳ）から様々なフォーマット（例えば、ＪＳＯＮ、ＢＳＯＮ、ＣＳＶ）でデータを供給する。オプションとしてデータは圧縮機構（例えば、ＧＺＩＰ、ＢＺＩＰ２、Ｓｎａｐｐｙ）を用いて圧縮される。基本手順は、使用されるフォーマットに関わらず、適用できる。 Collection Services Collection Services is responsible for loading semi-structured data into storage services that can query semi-structured data. Users supply data in various formats (eg, JSON, BSON, CSV) from various platforms (eg, MongoDB, Amazon S3, HDFS). Data is optionally compressed using a compression mechanism (eg, GZIP, BZIP2, Snappy). The basic procedure is applicable regardless of the format used.

採集タスクは、２つの部分、すなわち、大量の新しいユーザデータをロードする初期採集タスクと、新しいデータが得られたときに定期的に発生する増分採集と、に大まかに分けることができる。 The harvesting task can be broadly divided into two parts: an initial harvesting task that loads a large amount of new user data, and an incremental harvesting that occurs periodically as new data is obtained.

初期採集
初期採集プロセスは、幾つかのステップに分けることができる。最初に、入力データをチャンクにパーティション化する。ユーザは、ファイルのコレクションとして、あるいは、データソースに直接接続することによって初期データを供給する。これらのファイルの位置やフォーマットは、メタデータサービス内に記録される。ユーザは、例えばログファイルローテーションのせいで既にパーティション化されデータを供給してよいが、そうではない場合、ファイルは、並列ロードをサポートするためにチャンクにパーティション化することができる。これらのチャンクは、典型的には約数百メガバイトであり、独立して処理される。 Initial Collection The initial collection process can be divided into several steps. First, partition the input data into chunks. The user supplies the initial data as a collection of files or by connecting directly to the data source. The location and format of these files are recorded in the metadata service. The user may already partition and supply data, for example due to log file rotation, but otherwise the file may be partitioned into chunks to support parallel loading. These chunks are typically around hundreds of megabytes and are processed independently.

入力ファイルをパーティション化するための正確な機構は、データフォーマットに応じて決まる。レコードが復帰改行によって分離される非圧縮フォーマット（例えば、ＪＳＯＮまたはＣＳＶ）の場合、単一ファイルは、チャンクの目標数に等しい数のプロセスを用いて、並列に処理することができる。処理は、ファイル（ｆｉｌｅ＿ｓｉｚｅ／ｔｏｔａｌ＿ｎｕｍ＿ｃｈｕｎｋｓ）^*ｃｈｕｎｋ＿ｎｕｍ内の適切なオフセットで開始し、復帰改行が見付けられるまで検索する。圧縮されたデータまたはＢＳＯＮのようなバイナリフォーマットのデータの場合、各入力ファイルは、連続的スキャンが必要な場合がある。各チャンク（ファイル、オフセット、サイズ）の位置は、メタデータサービス内に記憶される。 The exact mechanism for partitioning the input file depends on the data format. For uncompressed formats (eg, JSON or CSV) where records are separated by newlines, single files can be processed in parallel using a number of processes equal to the target number of chunks. The process starts at the appropriate offset within the file (file_size / total_num_chunks) ^* chunk_num and searches until a newline is found. For compressed data or data in binary format such as BSON, each input file may require a continuous scan. The location of each chunk (file, offset, size) is stored in the metadata service.

データがチャンクに分けられると、実際のスキーマ推論及び採集が行われる。このプロセスの一部として、２つのサービス、すなわち、採集サービス及びストレージサービスが起動される。この２つのサービスは、作業をするために複数のサーバを採用することができる。２つのサービスはまた、任意の所与のマシン上に併置することができる。採集サービスは、一時的であり、採集プロセス中だけ使用される一方、ストレージサービスは、実際のデータを保持し、持続的でなければならない。採集に関わるサーバは、ストレージサービスサーバにデータを送信し、採集サーバの数は、ストレージサーバの数とは無関係であり、ここでは、その数は、各サービスのスループット間の不均衡を最小にするように選択される。チャンクは、採集サーバ間でパーティション化される。各採集サーバは、自身に割り当てられた各チャンクについて以下のステップ、すなわち、（ｉ）構文解析と型推論、（ｉｉ）ストレージサービスとの通信、（ｉｉｉ）ローカルスキーマ及び統計の計算、を担当する。 Once the data is divided into chunks, the actual schema inference and collection takes place. As part of this process, two services are launched: a harvesting service and a storage service. The two services can employ multiple servers to work. The two services can also be co-located on any given machine. While the collection service is temporary and used only during the collection process, the storage service must hold the actual data and be persistent. The servers involved in the collection send data to the storage service server, the number of collection servers being independent of the number of storage servers, where the number minimizes the imbalance between the throughput of each service To be chosen. Chunks are partitioned between collection servers. Each collection server is responsible for the following steps for each chunk assigned to it: (i) parsing and type inference, (ii) communication with storage services, (iii) calculation of local schema and statistics .

最初に、データレコードが、内部ツリー表現に構文解析される。一貫した内部表現を全てのソースフォーマット（ＪＳＯＮ、ＢＳＯＮ等）について使用してよい。入力フォーマットに応じて、型推論も行われてよい。例えば、ＪＳＯＮは日付表現を持たないので、文字列として日付を記憶するのが一般的である。日付はよく見られるので、ユーザが日付を使用してクエリを発行できるように採集中に検出される型の一実施例である。ＣＳＶ入力ファイルの場合、カラムは型付けされないので、整数などの基本型もまた、検出する必要がある。 First, data records are parsed into an internal tree representation. A consistent internal representation may be used for all source formats (JSON, BSON etc). Type inference may also be performed depending on the input format. For example, since JSON has no date representation, it is common to store the date as a string. Because dates are often viewed, this is one example of the type that is detected during collection so that users can issue queries using dates. For CSV input files, columns are not typed, so basic types such as integers also need to be detected.

レコードが構文解析され、型が推論されると、構文解析ツリーの圧縮された表現が、ストレージサービスに送信される。これは、ツリーの前順走査の形を取る。ストレージサービスは、各インデックス（ＢＩ、ＡＩ等）に記憶する値を決定することと、タプルｉｄ及び結合キーを生成することと、を担当する。キー生成はストレージサービスに任されて、キーが順次、生成され、それによって、基礎的インデックスストアに対する採集性能を改善する。 As the records are parsed and types are inferred, a compressed representation of the parse tree is sent to the storage service. This takes the form of a pre-order traversal of the tree. The storage service is responsible for determining the values to be stored in each index (BI, AI, etc.) and for generating tuple ids and join keys. Key generation is left to the storage service, where keys are generated sequentially, thereby improving the collection performance for the underlying index store.

レコードが採集される際、ローカルＪＳＯＮスキーマは、上記規則を使用して更新される。スキーマは、単一採集マシンによって見られるレコードを反映し、異なるマシンは、異なるスキーマを有してよい。 When records are collected, the local JSON schema is updated using the above rules. The schema reflects the records seen by a single collection machine, and different machines may have different schemas.

スキーマの計算に加えて、統計が維持される。それは、クエリ処理及びマップ識別に有用である。統計は、各属性が現れる回数、及び、その属性の平均バイトサイズなどのメトリクスを含む。例えば、下記レコード
{ id: 3546732984 }
{ id: "3487234234" }
{ id: 73242342343 }
{ id: 458527434332 }
{ id: "2342342343" }
は、スキーマ｛ｉｄ：ｉｎｔ，ｉｄ：ｓｔｒｉｎｇ｝を生成し、ｉｄ：ｉｎｔには、３のカウントを付記し、ｉｄ：ｓｔｒｉｎｇには、２のカウントを付記することができる。各採集マシンは、自身が計算したスキーマ及び統計をメタデータサービスに記憶する。 Statistics are maintained in addition to schema calculations. It is useful for query processing and map identification. The statistics include metrics such as the number of times each attribute appears and the average byte size of the attribute. For example, the following record
{id: 3546732984}
{id: "3478234234"}
{id: 73242342343}
{id: 45852743332}
{id: "2342234343"}
Creates a schema {id: int, id: string}, where id: int may have a count of 3, and id: string may have a count of 2. Each collection machine stores its own calculated schema and statistics in the metadata service.

チャンクの全てが採集されると、全体スキーマが計算される。全体スキーマは、クエリエンジンによって使用され、ユーザに提示される。これは、メタデータサービスから部分的スキーマを読み取り、その部分的スキーマを上記方法を使用してマージし、その結果をメタデータサービスに戻して記憶するという、単一プロセスを用いて達成することができる。スキーマの数は、採集マシンの数に限られるので、このプロセスは、パフォーマンスクリティカルではない。 Once all of the chunks have been collected, the entire schema is calculated. The entire schema is used by the query engine and presented to the user. This can be accomplished using a single process that reads partial schemas from the metadata service, merges the partial schemas using the method above, and stores the results back to the metadata service. it can. This process is not performance critical as the number of schemas is limited to the number of collection machines.

マップ決定はオプションである。前述のように、どの属性をＭＩ内にマップとして記憶すべきかを決定するために、メタデータサービスに記憶された統計と共にヒューリスティックスを使用することができる。これはクエリ処理に必要ではないが、一部のクエリの表現をより自然にし、効率を改善することを、思い起されたい。マップが識別されると、各ストレージサーバは、どの属性がマップであるべきかを識別するメッセージを受信する。ストレージサーバは、これらのカラムをスキャンし、ＭＩに挿入する。 Map determination is optional. As mentioned above, heuristics can be used with the statistics stored in the metadata service to determine which attributes to store as maps in the MI. While this is not necessary for query processing, remember that it makes some queries more natural and improves efficiency. Once the map is identified, each storage server receives a message identifying which attribute should be the map. The storage server scans these columns and inserts them into the MI.

増分更新
自身のデータの大半を前もってロードする一部のユーザもいるが、大抵のユーザは、時間をかけて新しいデータを周期的に、多くの場合、定期的（例えば、毎時または毎日）プロセスの一部として、ロードする。このデータの採集は、初期採集に概ね類似する。新しいデータは、チャンクに分割され、スキーマはチャンク毎に計算され、ローカルスキーマは、メタデータサービス内に維持されたグローバルスキーマとマージされる。 Incremental Update Although some users may preload the majority of their data, most users will periodically take time to periodically update new data, often on a regular (eg, hourly or daily) basis. Load as part. Collection of this data is generally similar to initial collection. New data is divided into chunks, schemas are calculated for each chunk, and local schemas are merged with the global schema maintained in the metadata service.

新しいデータが追加されると、システムは自動的にそのデータを検出する。方法はソースデータプラットフォームに応じて決まる。例えば、Ｓ３ファイルの場合、最も簡単なケースは、Ｓ３バケットの変化を検出することである。特殊なプロセスは、新しいキーと値のペア（すなわち、新しいファイル）を探してバケットを定期的にスキャンし、見付けたものをメタデータサービスに追加する。一定数の新しいファイルが見付けられた後、または、一定期間が経過した後、プロセスは、データをロードするための新しい採集プロセスを起動する。 When new data is added, the system automatically detects that data. The method depends on the source data platform. For example, in the case of the S3 file, the simplest case is to detect changes in the S3 bucket. A special process is to periodically scan the bucket for new key / value pairs (i.e., new files), and add those found to the metadata service. After a certain number of new files have been found, or after a certain period of time has elapsed, the process starts a new collection process for loading data.

ＭｏｎｇｏＤＢで行われる操作は、オペレーションログ（またはｏｐｌｏｇ）と呼ばれる特殊コレクションに記憶することができる。ｏｐｌｏｇは、そのレコードは複製のために内部でＭｏｎｇｏＤＢによって使用される、書き込み操作の一貫したレコードを提供する。ｏｐｌｏｇを読み取り、使用して、新しいレコードを記憶するＳ３内にフラットファイルのセットを作成する。次に、上記方法を用いて、新しいデータを採集することができる。 Operations performed by MongoDB can be stored in special collections called operation logs (or oplogs). oplog provides a consistent record of write operations whose records are used internally by MongoDB for replication. Read and use oplog to create a set of flat files in S3 to store new records. The above method can then be used to collect new data.

増分採集プロセスは、新しいデータ（例えば、新しいＪＳＯＮドキュメント）と、既存のドキュメントへの更新（例えば、既存のＪＳＯＮドキュメントにおける新しい属性または既存の属性についての新しい値）の両方を取り扱うことができる。各データソースプラットフォームの能力は、ソースファイルにおける更新の公開という点で異なる。この情報の種類を“ｄｅｌｔａ”と呼び、フラットファイルまたはログファイル（例えば、ＭｏｎｇｏＤＢ）の形を取り得る。増分採集プロセスは、“ｄｅｌｔａ”ファイルからの情報を処理し、それを既存のスキーマ情報と組み合わせて、新しいデータを生成し、そのデータは、ストレージサービスに送信される。 The incremental harvesting process can handle both new data (eg, new JSON documents) and updates to existing documents (eg, new attributes in existing JSON documents or new values for existing attributes). The capabilities of each data source platform differ in that they publish updates in source files. This type of information is called "delta" and may take the form of a flat file or a log file (eg MongoDB). The incremental harvest process processes the information from the "delta" file and combines it with existing schema information to generate new data, which is sent to the storage service.

データのサブセット化
データ採集と増分更新を行うための、ここに記載のシステムは、ソースから全データを採集することができる一方、採集したいデータのＪＳＯＮスキーマ（または関係スキーマ）を前もって指定することによって、比較的簡単にサブセットだけを採集することができる。これは、ＪＳＯＮスキーマ自体の提供、あるいは、サブセットを指定するクエリの提供によって行われる。このように、分析プラットフォームは、ソースデータを具体化したビューとして考えることができる。 Data Subsetting While the system described here for collecting data and incremental updates can collect all the data from the source, by pre-specifying the JSON schema (or relationship schema) of the data you want to collect It is relatively easy to collect only a subset. This is done by providing the JSON schema itself or providing a query specifying a subset. Thus, the analysis platform can be considered as a materialized view of the source data.

また、ユーザが採集されたくないデータを指定することも可能である。採集されるべきでないデータの一部を記述するＪＳＯＮスキーマまたは関係スキーマを提供することができる。そうすると、単に、全てのその後のローの、採集されるべきでない要素を単純にスキップするように採集処理に告げる情報の、メタデータサービスへの記録の問題に過ぎない。これをデータ採集後に行う場合、既に採集されたデータは、単純に入手不能になり、バックグラウンドタスクによってガベージコレクションにかけることができる。このガベージコレクションは、インデックスストア（例えば、ＬｅｖｅｌＤＢ）の圧縮プロセスに組み込まれることになる。 It is also possible to specify data that the user does not want to collect. A JSON schema or relationship schema can be provided that describes some of the data that should not be collected. Then it's just a matter of recording to the metadata service of the information that tells the collection process to simply skip all subsequent rows that should not be collected. If this is done after data collection, the data that has already been collected is simply not available and can be garbage collected by the background task. This garbage collection will be incorporated into the compression process of the index store (eg, LevelDB).

耐障害性
初期採集中にロードプロセスを再起動することが可能である一方、増分採集プロセスは、ユーザが全データをゼロからリロードする必要がないように、システムの既存データを破損してはいけない。ファイル採集は冪等演算ではないので、ｉｄ生成に起因して、基礎的ストレージシステムのスナップショットを撮ることに基づいて簡単な耐障害性スキームを実装することができる。 Fault Tolerance While it is possible to restart the loading process during initial harvesting, the incremental harvesting process should not corrupt existing data in the system so that the user does not have to reload all data from scratch. . Since file collection is not a plagiarism operation, a simple fault tolerance scheme can be implemented based on taking a snapshot of the underlying storage system due to id generation.

基礎的ストレージシステムが、ＬｅｖｅｌＤＢが行うように、ある時点で一貫性のあるスナップショットを撮ることをサポートする場合、スナップショットを撮ることは、簡単であろう。このプリミティブを用いると、増分ロードのステップは、以下のようになる。単一プロセスは、各ストレージサーバに、ローカルにスナップショットを撮るように指示し、ロード期間中、全クエリをこのスナップショットに導く。各チャンクは、前述のようににロードされる。完了すると、チャンクのロードを担当する採集サーバは、メタデータサービスにおいて終了したものとしてマークを付ける。 If the underlying storage system supports taking consistent snapshots at some point in time, as LevelDB does, taking snapshots may be easy. Using this primitive, the steps for incremental loading are as follows: A single process instructs each storage server to take a snapshot locally, leading all queries to this snapshot during the load period. Each chunk is loaded as described above. Once complete, the collection server responsible for loading the chunks marks it as terminated in the metadata service.

プロセスは、メタデータサービスを監視する。全チャンクがロードされている場合、プロセスは、状態の更新済みバージョンにクエリをアトミックにリダイレクトする。そして、第１のステップで撮られたスナップショットは捨ててよい。失敗した場合、スナップショットは、状態の正準バージョンになり、状態の部分的に更新された（及び破損の可能性のある）元のバージョンは捨てられる。そして、採集処理が再開される。さらに、サーバが故障の場合には、ストレージシステムディスク量のスナップショットを、回復のために使用することができる。 The process monitors metadata services. If all chunks are loaded, the process redirects the query atomically to the updated version of the state. Then, the snapshot taken in the first step may be discarded. If it fails, the snapshot becomes the canonical version of the state, and the partially updated (and possibly corrupted) original version of the state is discarded. Then, the collection process is resumed. In addition, in the event of a server failure, snapshots of storage system disk space can be used for recovery.

クエリ実行
クエリ例
実行例を示すために、簡単なクエリ
select count(*) from table as t where t.a > 10;
を考える。
最初に、プロキシは、クエリを受信し、計画立案のためにそれをエクゼキュータノードに発行する。次に、エクゼキュータノードは、どのコレクション及びストレージノードが使用可能であるかを決定するためにメタデータサービスを呼び出すクエリ計画を生成する。エクゼキュータノードは、典型的には、その計画を他のエクゼキュータノードに分散させるが、ここでは、単一のエクゼキュータノードだけを必要とする。 Query execution query example A simple query to show an execution example
select count (*) from table as t where ta>10;
think of.
First, the proxy receives the query and issues it to the executor node for planning. The executor node then generates a query plan that invokes the metadata service to determine which collection and storage nodes are available. The executor node typically distributes its plans to other executor nodes, but here only needs a single executor node.

そして、実行ノードは、ＲＰＣ呼び出しをストレージサービスノードに行い、ｔ．ａ＞１０述語及びカウント関数をプッシュダウンする。次に、ストレージノードは、サブカウントを計算し、それらをエクゼキュータノードに返す。そして、エクゼキュータノードは、プロキシが次の結果値をフェッチするときに結果をプロキシに返す。 Then, the execution node makes an RPC call to the storage service node, and t. a> Push down the 10 predicates and the counting function. The storage node then calculates subcounts and returns them to the executor node. The executor node then returns the result to the proxy when it fetches the next result value.

動的型付け
データベースシステム（例えば、ＰｏｓｔｇｒｅＳＱＬ）のストレージエンジンは、強く型付けされる。それは、カラム（または属性）の値の全てが、全く同じ型（例えば、整数、文字列、タイムスタンプ等）を有する必要があることを意味する。ビッグデータ分析の文脈において、これは、かなりの制約である。なぜなら、かなり多くのアプリケーションが個々の情報（属性）の表現を変える必要があり、その結果として、アプリケーションがその情報の記憶に使用するデータ型の変更が必要となるからである。例えば、アプリケーションは、最初に、整数を使用して、ある属性の値を記憶し、その後、浮動小数の使用に切り替える場合がある。データベースシステムは、そのような操作をサポートするように設計されていない。 The storage engine of a dynamically typed database system (eg, PostgreSQL) is strongly typed. That means that all of the column (or attribute) values need to have exactly the same type (eg integer, string, timestamp etc). In the context of big data analysis, this is a significant limitation. This is because a large number of applications need to change the representation of individual information (attributes), and as a result, it is necessary to change the data type that the application uses to store that information. For example, an application may first store the value of an attribute using an integer and then switch to using floating point. Database systems are not designed to support such operations.

この問題に対処する１つの方法は、各属性について複数の関係カラムを使用すること、すなわち、各異なるデータ型について１つの関係カラムを使用することである。例えば、値３１４３２及び“３１４３３”（すなわち、整数及び文字列）を有する属性“ｕｓｅｒ．ｉｄ”を見た場合、別個のカラムとして“ｕｓｅｒ．ｉｄ＜ｉｎｔ＞”及び“ｕｓｅｒ．ｉｄ＜ｓｔｒｉｎｇ＞”を記憶することができる。単一のレコードは、これらのカラムのうち、そのレコードの“ｕｓｅｒ．ｉｄ“の型に対応する１つのカラムにのみ値を有することになる。そのレコードに関する他のカラムの値はＮＵＬＬとなる。 One way to address this problem is to use multiple relationship columns for each attribute, ie, one relationship column for each different data type. For example, if you look at the attribute "user.id" with values 31432 and "31433" (ie integers and strings), separate columns "user.id <int>" and "user.id <string>" Can be stored. A single record will have a value in only one of these columns that corresponds to the "user.id" type of that record. The other column values for that record will be NULL.

同じ属性について複数のカラムを提示することは、ユーザが使用するには複雑すぎることが多い。ユーザ体験を簡単にするために、分析プラットフォームは、ユーザが用いようとする型を、クエリ時に動的に推論することができる。この目的のために、ストレージサービスは、複数の型を追跡している。例えば、ストレージサービスは、ＮＵＭＢＥＲと呼ばれる数字に関する包括的データ型を使用する。ＮＵＭＢＥＲは、整数と浮動小数の両方をカバーする。ＮＵＭＢＥＲ型を使用する場合、更に具体的なデータ型が、値の一部として記憶される。例えば、属性“Ｃｕｓｔｏｍｅｒ．ｍｅｔｒｉｃ”の整数値１０は、キーと値のペアとしてＢＩ内に記憶される。ここで、（Ｃｕｓｔｏｍｅｒ．ｍｅｔｒｉｃ，＜ＮＵＭＢＥＲ＞，ｔｉｄ）がキーであり、（１０，ＩＮＴＥＧＥＲ）が値である。同じ属性の浮動小数点値１０．５は、キー：（Ｃｕｓｔｏｍｅｒ．ｍｅｔｒｉｃ，＜ＮＵＭＢＥＲ＞，ＴＩＤ）、値：（１０．５，ＦＬＯＡＴ）として記憶されることになる。 Presenting multiple columns for the same attribute is often too complex for the user to use. To simplify the user experience, the analysis platform can dynamically infer at query time what type the user is going to use. For this purpose, storage services are tracking multiple types. For example, storage services use a generic data type for numbers called NUMBER. NUMBER covers both integers and floats. When using the NUMBER type, more specific data types are stored as part of the value. For example, the integer value 10 of the attribute "Customer.metric" is stored in BI as a key-value pair. Here, (Customer.metric, <NUMBER>, tid) is a key and (10, INTEGER) is a value. The floating point value 10.5 of the same attribute will be stored as the key: (Customer.metric, <NUMBER>, TID), value: (10.5, FLOAT).

最終的に、クエリ時に、分析プラットフォームは、クエリの特性（述語、型変換操作等）に従ってデータ型間の動的型変換を、これらの操作が情報損失をもたらさない限り、実行できる。“ｎｕｍｂｅｒ”はＡＮＳＩＳＱＬ型ではないが、柔軟な型付けシステムによって、クライアントは、クエリコンテキストから“ｎｕｍｂｅｒ”を、標準ＡＮＳＩＳＱＬ浮動小数、整数、または、数値型として扱うことができる。例えば、クエリ
select user.lang from tweets where user.id = '31432'
を考える。
“ｕｓｅｒ．ｉｄ＜ｉｎｔ＞”及び“ｕｓｅｒ．ｉｄ＜ｓｔｒｉｎｇ＞”の両方を有する場合、システムは、オプションとして、クエリ時に、整数（例えば、３１４３２）を単一文字列表現（例えば、“３１４３２”）に変換して、ユーザが、ＡＮＳＩＳＱＬ型ＶＡＲＣＨＡＲで単一カラム“ｕｓｅｒ．ｉｄ”を処理することを可能にする。 Finally, at query time, the analysis platform can perform dynamic type conversion between data types according to the characteristics of the query (predicates, type conversion operations, etc.), as long as these operations do not result in information loss. Although "number" is not an ANSI SQL type, the flexible typing system allows the client to treat "number" from the query context as standard ANSI SQL floating point, integer, or numeric types. For example, query
select user.lang from tweets where user.id = '31432'
think of.
If you have both “user.id <int>” and “user.id <string>”, the system optionally optionally displays an integer (eg, 31432) as a single string representation (eg, “31432”) at query time. To allow the user to process single column "user.id" with ANSI SQL type VARCHAR.

米国規格協会（ＡＮＳＩ：ＡｍｅｒｉｃａｎＮａｔｉｏｎａｌＳｔａｎｄａｒｄｓＩｎｓｔｉｔｕｔｅ）／国際標準化機構（ＩＳＯ：ＩｎｔｅｒｎａｔｉｏｎａｌＯｒｇａｎｉｚａｔｉｏｎｆｏｒＳｔａｎｄａｒｄｉｚａｔｉｏｎ）ＳＱＬ：２００３に例として言及するが、他の実装では、他の規格、ＳＱＬまたはそれ以外のものに準拠してよい。ほんの一例として、公開したインターフェースは、ＡＮＳＩ／ＩＳＯＳＱＬ：２０１１に準拠する。 The American National Standards Institute (ANSI) / International Organization for Standardization (ISO) SQL: 2003 is mentioned as an example, but in other implementations it conforms to other standards, SQL or something else You may As just one example, the published interface conforms to ANSI / ISO SQL: 2011.

図面
図１Ａは、分析プラットフォームのクラウドベースの実装例を示す。分析フレームワークを使用する組織のローカルエリアネットワーク（ＬＡＮ）またはワイドエリアネットワーク（ＷＡＮ）１００が、インターネット１０４に接続する。この実装における計算ニーズ及びストレージニーズは、いずれも、クラウドベースのサービスによって提供される。図に示す特定の実装では、計算サーバは、ストレージサーバとは別個である。詳細には、計算クラウド１０８は、分析フレームワークに処理能力を与える複数のサーバ１１２を含む。サーバ１１２は、個別のハードウェアインスタンスであってもよいし、あるいは、仮想サーバであってもよい。 Drawings FIG. 1A shows a cloud-based implementation of the analysis platform. An organization's local area network (LAN) or wide area network (WAN) 100 using the analytics framework connects to the Internet 104. Both computational and storage needs in this implementation are provided by cloud based services. In the particular implementation shown, the compute server is separate from the storage server. In particular, computing cloud 108 includes a plurality of servers 112 that provide processing power to the analytics framework. The server 112 may be a separate hardware instance or may be a virtual server.

サーバ１１２は、処理機能が動作する、それ自体のストレージも有してよい。例えば、サーバ１１２は、クエリエクゼキュータとストレージサービスの両方を実装してよい。従来のカラム型ストレージシステムは、ディスク上にカラムとしてデータを記憶し、そのデータはメモリに読み込まれるとき、そのカラム型データからローが再構築される。しかしながら、本開示のインデックスは、ディスク上とメモリ内の両方でカラム型ストレージとして機能する。インデックスが固有に構成されているので、高速カラム型アクセスという利点を、比較的小さな犠牲で達成することができる。 The server 112 may also have its own storage on which the processing functions operate. For example, server 112 may implement both a query executor and a storage service. Conventional column storage systems store data as columns on a disk, and when the data is read into memory, rows are reconstructed from the column data. However, the index of the present disclosure functions as column storage both on disk and in memory. Because the index is inherently configured, the benefits of fast column access can be achieved at a relatively small cost.

本開示によれば、データはインデックスに格納され、具体化されたテーブルには格納されないので、ストレージクラウド１１６は、インデックスデータに使用するストレージ配列１２０を含む。サーバ１１２のストレージリソースが使用されるとき、ストレージ配列１２０は、各クエリに応答するためではなく、バックアップ及びニアラインストレージのために使用されてよい。 According to the present disclosure, the storage cloud 116 includes a storage array 120 used for index data, as the data is stored in the index and not in the materialized table. When the storage resources of server 112 are used, storage array 120 may be used for backup and near-line storage, not for responding to each query.

様々な実装において、ストレージ配列１２４は、分析フレームワークが処理するデータを含んでよい。ほんの一例として、ストレージ配列１２４は、ログデータなどの関連データを保持してよく、ユーザは、分析フレームワークを使用してそのデータにクエリすることを望んでよい。
ストレージ配列１２０及びストレージ配列１２４は、同じストレージクラウド１１６内に示されているが、外部にホストされたプライベートクラウド、パブリッククラウド、及び、組織に固有の内部にホストされた仮想環境を含む、別々のクラウド内に配置されてよい。 In various implementations, storage array 124 may include data that the analysis framework processes. By way of example only, storage array 124 may hold relevant data, such as log data, and the user may wish to query that data using an analysis framework.
Storage array 120 and storage array 124 are shown in the same storage cloud 116 but are separate, including an externally hosted private cloud, a public cloud, and an internally hosted virtual environment specific to an organization. May be placed in the cloud.

ほんの一例として、ストレージクラウド１１６は、Ａｍａｚｏｎウェブサービス（ＡＷＳ）Ｓ３クラウドであってよい。Ａｍａｚｏｎウェブサービス（ＡＷＳ）Ｓ３クラウドは、企業がストレージ配列１２４にデータを記憶するために既に使用している。結果的に、データをストレージ配列１２０に移すことは、高スループット及び低費用で実現され得る。計算クラウド１０８は、ＡＷＳＥＣ２によって提供されてよく、その場合、計算クラウド１０８及びストレージクラウド１１６は、共通のプロバイダによってホストされる。ユーザ１３０は、標準ＳＱＬツールを使用してクエリを構築し、そのクエリは計算クラウド１０８で実行され、応答はユーザ１３０に返される。ＳＱＬツールは、ユーザ１３０のコンピュータ１３４に既にインストールされたツールであってよく、本分析フレームワークと共に動作するために修正する必要は無い。 By way of example only, storage cloud 116 may be an Amazon Web Services (AWS) S3 cloud. Amazon Web Services (AWS) S3 Cloud is already being used by companies to store data in storage array 124. As a result, transferring data to storage array 120 can be realized with high throughput and low cost. The compute cloud 108 may be provided by AWS EC2, in which case the compute cloud 108 and the storage cloud 116 are hosted by a common provider. The user 130 builds a query using standard SQL tools, which are executed in the computing cloud 108 and responses are returned to the user 130. The SQL tool may be a tool already installed on the user's 130 computer 134 and does not need to be modified to work with the analysis framework.

図１Ｂでは、別のデプロイメントアプローチ例が示される。この場合、物理サーバ機器１８０は企業のＬＡＮ／ＷＡＮ１００に接続される。サーバ機器１８０は、オンサイトでホストされてもオフサイトでホストされてもよく、仮想プライベートネットワーク等を用いて、ＬＡＮ／ＷＡＮ１００に接続されてよい。サーバ機器１８０は、計算機能及びストレージを含み、ソースから入力データを受信する。ソースは、ＬＡＮ／ＷＡＮ１００に対してローカルであってよい。ほんの一例として、コンピュータもしくはサーバ１８４は、ウェブトラフィックログまたは侵入検出ログなどのログを記憶してよい。 In FIG. 1B, another example deployment approach is shown. In this case, the physical server device 180 is connected to the LAN / WAN 100 of the corporation. The server device 180 may be hosted on site or off site, and may be connected to the LAN / WAN 100 using a virtual private network or the like. The server device 180 includes computing functions and storage, and receives input data from a source. The source may be local to the LAN / WAN 100. By way of example only, computer or server 184 may store logs such as web traffic logs or intrusion detection logs.

サーバ機器１８０は、ユーザ１３０のクエリに応答するためにインデックスデータを取り出し、記憶する。ストレージクラウド１１６は、追加データソース１８８を含んでよく、追加データソース１８８は、更に他のデータを保持してよく、及び／または、古いデータのためのニアラインデータストレージ設備であってよい。サーバ機器１８０は、ユーザクエリを満足させるために、追加データソース１８８から追加データを取り出してよい。サーバ機器１８０はまた、バックアップ目的などのために、ストレージクラウド１１６にデータを記憶してよい。様々な他の実装においては、追加データソース１８８は、クラウドにおけるＨａｄｏｏｐ実装の一部であってよい。 Server device 180 retrieves and stores index data to respond to user 130 queries. The storage cloud 116 may include additional data sources 188, which may further hold other data and / or be a near-line data storage facility for old data. Server device 180 may retrieve additional data from additional data source 188 to satisfy the user query. Server device 180 may also store data in storage cloud 116, such as for backup purposes. In various other implementations, the additional data sources 188 may be part of a Hadoop implementation in the cloud.

本開示の分析フレームワークは柔軟で、多くの他のデプロイメントシナリオが可能である。ほんの一例として、ソフトウェアが企業に提供されてよく、そのソフトウェアは、所有のサーバまたはホストされたサーバにインストールできる。別の実装では、仮想マシンインスタンスが提供されてよく、仮想マシンインスタンスは、仮想化環境を通してインスタンス化することができる。さらに、ユーザは、ブラウザにおいてユーザインタフェースを提供されてよく、ＳＱＬ部分は、ＮｏｕＤａｔａ等のサービスプロバイダによってホストされてよく、それらのシステム上またはクラウドで実装されてよい。 The analysis framework of the present disclosure is flexible and allows many other deployment scenarios. By way of example only, software may be provided to the enterprise, which can be installed on its own server or hosted server. In another implementation, virtual machine instances may be provided, and virtual machine instances may be instantiated through a virtualized environment. Furthermore, the user may be provided with a user interface in the browser, and the SQL part may be hosted by a service provider such as Nou Data and implemented on their system or in the cloud.

データウェアハウス
図１Ｃは、本開示の原理によるデプロイメントアプローチ例を示す。採集したデータは、インデックスストレージ１２０内に加えて、または、その代わりに、データウェアハウス１９２に記憶することができる。様々な実装において、データウェアハウス１９２は、カスタマーサイトに配置してもよく、あるいは、図１Ｃに示すように、クラウド１９６に配置してもよい。 Data Warehousing FIG. 1C illustrates an example deployment approach according to the principles of the present disclosure. The collected data may be stored in data warehouse 192 additionally or alternatively in index storage 120. In various implementations, the data warehouse 192 may be located at the customer site or, as shown in FIG. 1C, in the cloud 196.

データウェアハウス１９２を用いることによって様々な利点がある。ほんの一例として、データウェアハウス１９２は、一般的に、成熟したフル機能を備えたクエリ処理層及びＯＤＢＣインタフェースを有する。さらに、データウェアハウス１９２は、本開示によるシステムが採集する以外の他のデータのためのセントラルレポジトリであってよい。さらに、データウェアハウス１９２は、データのスナップショット及びリビジョン制御を実装してよく、また、確立されたバックアップ戦略の一部であってよい。 There are various advantages to using data warehouse 192. By way of example only, data warehouse 192 generally has a mature, full-featured query processing layer and an ODBC interface. Furthermore, data warehouse 192 may be a central repository for other data besides the system according to the present disclosure. Further, data warehouse 192 may implement snapshot and revision control of data and may be part of an established backup strategy.

様々な実装において、データウェアハウス１９２は、ＰｏｓｔｇｒｅＳＱＬ、ＭｙＳＱＬ、ＭｉｃｒｏｓｏｆｔＳＱＬサーバ、Ｏｒａｃｌｅなどを含む、ＳＱＬコマンドのサブセットまたはフルセットをサポートする関係データベース等、単に関係データベースであってよい。採集したデータのスキーマは経時的に変わり得るので、カラム型ストレージを用いたデータウェアハウス１９２の実装によって、追加のカラム（新しい属性または既存の属性の新たな型など)が効率的に追加できるようになる。 In various implementations, the data warehouse 192 may simply be a relational database, such as a relational database supporting a subset or full set of SQL commands, including PostgreSQL, MySQL, Microsoft SQL Server, Oracle, etc. Because the schema of the collected data can change over time, the implementation of the data warehouse 192 with column-based storage allows for the efficient addition of additional columns (such as new attributes or new types of existing attributes) become.

ロー指向の従来のデータベースシステムにおいては、カラムの追加は、時間的、及び／または、空間的に非効率な場合がある。データウェアハウス１９２の様々な実装は、Ｖｅｒｔｉｃａ、Ｇｒｅｅｎｐｌｕｍ、Ａｓｔｅｒ／Ｔｅｒａｄａｔａ、及び、Ａｍａｚｏｎ（ＲｅｄＳｈｉｆｔ)の製品を含む、カラム型の特徴を有してよい。Ｖｅｒｔｉｃａ及びＡｍａｚｏｎＲｅｄＳｈｉｆｔ等のデータウェアハウス１９２の実装の一部は、並列ロードをサポートするので、適切にフォーマットされたデータを幾つかのソースから同時に採集できる。複数の中間ファイルにデータをパッケージ化することによって、データのデータウェアハウス１９２へのロードに必要な時間を大幅に減らし得る。 In a row oriented conventional database system, the addition of columns may be temporally and / or spatially inefficient. Various implementations of data warehouse 192 may have column-type features, including products of Vertica, Greenplum, Aster / Teradata, and Amazon (RedShift). Some implementations of the data warehouse 192, such as Vertica and Amazon RedShift, support parallel loading, so that properly formatted data can be collected simultaneously from several sources. By packaging data into multiple intermediate files, the time required to load data into data warehouse 192 can be significantly reduced.

データウェアハウス１９２をクラウド１９６で実装することによって、データウェアハウス１９２用のハードウェア及びソフトウェアの購入に関連する初期費用を減らせる等、様々な利益が得られる。さらに、データウェアハウス１９２を提供するクラウド１９６は、インターネット１０４の公共部分を介して入手可能なより大きいスループットで、インデックスストレージ１２０またはデータソース１２４からデータを採取することが可能になり得る。データウェアハウス１９２がＡｍａｚｏｎＲｅｄＳｈｉｆｔであり、インデックスストレージ１２０がＡｍａｚｏｎＳ３に記憶される場合などの、様々な実装において、データは、インデックスストレージ１２０とデータウェアハウス１９２間を、Ａｍａｚｏｎのネットワークを離れることなく転送されてよい。これによって、待ち時間を減らすことができ、スループットを向上させることができる。 Implementing data warehouse 192 in cloud 196 provides various benefits, such as reducing the initial costs associated with purchasing hardware and software for data warehouse 192. Further, the cloud 196 providing the data warehouse 192 may be able to collect data from the index storage 120 or data source 124 with greater throughput available through the public portion of the Internet 104. In various implementations, such as when data warehouse 192 is Amazon RedShift and index storage 120 is stored on Amazon S3, data flows between index storage 120 and data warehouse 192 without leaving Amazon's network. May be forwarded. This can reduce latency and improve throughput.

図１Ｄは、サーバ２００のハードウェア構成要素を示す。プロセッサ２０４は、メモリ２０８からの命令を実行し、メモリ２０８内に記憶された（読み取り及び書き込み）データを処理してよい。一般に、スピードを求めて、メモリ２０８は揮発性メモリである。プロセッサ２０４は、潜在的にチップセット２１２経由で、不揮発性ストレージ２１６と通信する。様々な実装において、不揮発性ストレージ２１６は、キャッシュとして機能するフラッシュメモリを含んでよい。より大容量及び低費用のストレージを二次不揮発性ストレージ２２０のために使用してよい。例えば、ハードドライブなどの磁気ストレージ媒体を用いて、二次不揮発性ストレージ２２０に基礎的データを記憶してよく、基礎的データのアクティブな部分は、不揮発性ストレージ２１６にキャッシュされる。 FIG. 1D shows the hardware components of server 200. Processor 204 may execute instructions from memory 208 and process (read and write) data stored in memory 208. In general, for speed, memory 208 is a volatile memory. Processor 204 communicates with non-volatile storage 216, potentially via chipset 212. In various implementations, non-volatile storage 216 may include flash memory that functions as a cache. Higher capacity and lower cost storage may be used for secondary non-volatile storage 220. For example, underlying data may be stored on secondary non-volatile storage 220 using a magnetic storage medium such as a hard drive, and active portions of the underlying data are cached on non-volatile storage 216.

入出力機能２２４は、キーボードやマウスなどの入力と、グラフィックディスプレイや音声出力などの出力とを含んでよい。サーバ２００は、ネットワーキングカード２２８を使用して他のコンピュータデバイスと通信する。様々な実装において、または、様々な時間に、入出力機能２２４が休止し、サーバ２００と外部アクターとの間の全ての対話がネットワーキングカード２２８経由で行われてもよい。図を簡単にするために、一例として、不揮発性ストレージ２１６とメモリ２０８との間またはネットワーキングカード２２８とメモリ２０８との間のダイレクトメモリアクセス（ＤＭＡ）機能などの、追加のよく知られた特徴及び変形形態は、示していない。 The input / output function 224 may include an input such as a keyboard and a mouse, and an output such as a graphic display and an audio output. Server 200 communicates with other computing devices using networking card 228. In various implementations, or at various times, the I / O function 224 may be paused and all interaction between the server 200 and the external actor may occur via the networking card 228. To facilitate the illustration, and as an example, additional well-known features such as direct memory access (DMA) functions between non-volatile storage 216 and memory 208 or between networking card 228 and memory 208 and Variations are not shown.

データフロー
図２Ａにおいて、プロセス図は、ユーザ１３０がデータにクエリできるように当該データがどのように分析フレームワークに採集されるかの一実施例を示している。データソース３００は、分析フレームワークが処理するデータを提供する。生データが自己記述的でない場合、オプションであるユーザ定義のラッパー関数３０４によって、その生データをＪＳＯＮオブジェクト等の自己記述的な半構造データに変換してよい。 Data Flow In FIG. 2A, a process diagram illustrates an example of how data may be collected into an analysis framework so that users 130 can query the data. Data source 300 provides data that the analytics framework processes. If the raw data is not self-describing, the optional user-defined wrapper function 304 may convert the raw data into self-describing semi-structured data such as a JSON object.

アドミニストレータ３０８は、異なる容量で動作しているユーザ１３０であってもよく、ラッパー関数を実装するガイドラインを指定することができる。アドミニストレータ３０８は、また、データソース３００のうちのどのデータソースを使用するかと、そのデータソースからどのデータを取り出すかと、を指定することができる。様々な実装において、データの取り出しは、サブセット化操作及び／または他の計算を含み得る。ほんの一例として、データソース３００の１つがＨａｄｏｏｐである場合、分析フレームワークのためのデータ取り出しに先立って、ＭａｐＲｅｄｕｃｅジョブが要求されてよい。 An administrator 308 may be a user 130 operating at different capacities, and may specify guidelines for implementing wrapper functions. The administrator 308 can also specify which data source of the data sources 300 to use and which data to retrieve from that data source. In various implementations, retrieving data may include subsetting operations and / or other calculations. By way of example only, if one of the data sources 300 is Hadoop, a MapReduce job may be requested prior to data retrieval for the analytics framework.

取り出したデータは、スキーマ推論モジュール３１２によって処理される。スキーマ推論モジュール３１２は、受信したデータの観察された構造に基づいてスキーマを動的に構築する。アドミニストレータ３０８は、様々な実装において、スキーマ推論モジュール３１２に型付けのヒントを提供する能力を有してよい。例えば、型付けのヒントは、日付、時間、または、他のアドミニストレータが定義した型など、個々のフォーマットを認識するための要求を含んでよく、それらは、例えば、正規表現によって指定されてよい。 The extracted data is processed by the schema inference module 312. The schema inference module 312 dynamically builds a schema based on the observed structure of the received data. The administrator 308 may have the ability to provide typing hints to the schema inference module 312 in various implementations. For example, typing hints may include requests to recognize individual formats, such as dates, times, or other administrator defined types, which may be specified by, for example, regular expressions.

データオブジェクトと、スキーマ推論モジュール３１２が生成したスキーマとは、装飾モジュール３１６及びインデックス作成モジュール３２０に供給される。入力オブジェクトは、ソースデータと、そのソースデータを記述するメタデータと、を含む。ソースデータは、インデックス作成モジュール３２０によって、インデックスストレージ３２４に記憶される。 The data object and the schema generated by the schema inference module 312 are supplied to the decoration module 316 and the indexing module 320. The input object includes source data and metadata describing the source data. Source data is stored by the indexing module 320 in the index storage 324.

装飾モジュール３１６は、スキーマモジュール３１２が生成したスキーマ内のマップを識別する。マップの識別を求めない実装においては、装飾モジュール３１６は、省略されてよい。アドミニストレータ３０８は、マップ識別に用いられる装飾モジュール３１６が行うヒューリスティックスを調整するマップ基準を指定することができてよい。 Decorating module 316 identifies the maps in the schema generated by schema module 312. In implementations that do not require identification of the map, the decoration module 316 may be omitted. The administrator 308 may be able to specify map criteria to adjust the heuristics performed by the decoration module 316 used for map identification.

マップが識別された後、関係スキーマ作成モジュール３２８は、ＳＱＬ対応スキーマ等の関係スキーマを生成する。さらに、識別されたマップは、補助インデックス作成モジュール３３２に供給される。補助インデックス作成モジュール３３２は、ＭａｐＩｎｄｅｘ等の追加のインデックス、及び、上記のように、ＶａｌｕｅＩｎｄｅｘ内にマップエントリを作成することができる。補助インデックスも、インデックスストレージ３２４に記憶されてよい。 After the map is identified, relationship schema creation module 328 generates a relationship schema, such as a SQL enabled schema. Further, the identified map is provided to the auxiliary indexing module 332. The auxiliary index creation module 332 can create map entries in the Additional Index, such as MapIndex, and, as noted above, the ValueIndex. Auxiliary indexes may also be stored in index storage 324.

アドミニストレータ３０８は、マップインデックスの作成を要求する能力を有してよく、値インデックスにどのカラムを追加すべきかを指定してよい。さらに、アドミニストレータは、どのオブジェクトをマップとして扱うべきかを指定することができてよく、あるオブジェクトをマップとして扱うか否かを動的に変更することができる。このような変更によって、関係スキーマも変更されることになる。 The administrator 308 may have the ability to request the creation of a map index, and may specify which columns to add to the value index. Furthermore, the administrator can specify which objects should be treated as maps, and can dynamically change whether to treat an object as a map. Such changes will also change the relationship schema.

関係最適化モジュール３３６は、関係スキーマを最適化して、より簡潔なスキーマをユーザ１３０に提示する。例えば、関係最適化モジュール３３６は、上記のように、テーブルとテーブルの間の１対１の関係を識別し、それらのテーブルを平坦化して単一のテーブルにしてよい。結果として生じる関係スキーマは、メタデータサービス３４０に供給される。 The relationship optimization module 336 optimizes the relationship schema to present a more concise schema to the user 130. For example, relationship optimization module 336 may identify a one-to-one relationship between tables and tables, as described above, and flatten them into a single table. The resulting relationship schema is provided to metadata service 340.

クエリエグゼキュータ３４４は、メタデータサービス３４０とインタフェースして、プロキシ３４８からのクエリを実行する。プロキシ３４８は、ＯＤＢＣクライアント３５２等のＳＱＬ対応のクライアントと対話する。ＳＱＬ対応のクライアントは、特別な構成なしにプロキシ３４８と対話することができる。ユーザ１３０は、ＯＤＢＣクライアント３５２を用いて、クエリエグゼキュータ３４４にクエリを送り、そのクエリに対する応答を受信する。 Query executor 344 interfaces with metadata service 340 to execute queries from proxy 348. The proxy 348 interacts with SQL enabled clients such as the ODBC client 352. SQL enabled clients can interact with the proxy 348 without special configuration. The user 130 sends a query to the query executor 344 using the ODBC client 352 and receives a response to the query.

ＯＤＢＣクライアント３５２を介して、ユーザ１３０は、メタデータサービス３４０に記憶された関係スキーマを見ることもでき、関係スキーマに対するクエリを構築することができる。ユーザ１３０もアドミニストレータ３０８も予測スキーマを知る必要はなく、スキーマ作成を支援する必要もない。代わりに、スキーマは、取り出されたデータに基づいて動的に作成され、提示される。ＯＤＢＣクライアント３５２を図に示したが、ＪＤＢＣ及び直接的なｐｏｓｔｇｒｅｓクエリを含む、ＯＤＢＣ以外の機構も利用可能である。様々な実装において、グラフィカルユーザインタフェースアプリケーションによって、ユーザがＯＤＢＣクライアント３５２を使用するのを容易にし得る。 Through the ODBC client 352, the user 130 can also view the relationship schema stored in the metadata service 340 and can build queries against the relationship schema. Neither the user 130 nor the administrator 308 need to know the forecasting schema, nor does it need to support schema creation. Instead, schemas are created and presented dynamically based on the retrieved data. Although the ODBC client 352 is illustrated, mechanisms other than ODBC are also available, including JDBC and direct postgres queries. In various implementations, the graphical user interface application may facilitate the user in using the ODBC client 352.

クエリエグゼキュータ３４４は、インデックスストレージ３２４を含むストレージサービス３５６からのデータを処理する。ストレージサービス３５６は、それ自体のローカルストレージ処理モジュール３６０を含んでよく、クエリエグゼキュータ３４４は、ローカルストレージ処理モジュール３６０に様々な処理タスクを任せることができる。処理されたデータは、次に、ストレージ処理モジュール３６０によって、クエリエグゼキュータ３４４に供給され、受信したクエリに対する応答を構築する。クラウドベースの実装においては、ストレージサービス３５６及びクエリエグゼキュータ３４４は両方とも、計算クラウドで実装してよく、インデックスストレージ３２４は、計算インスタンスに記憶することができる。インデックスストレージ３２４は、図１Ａに示したストレージクラウド１１６内などの、ニアラインストレージにミラーされてよい。 Query executor 344 processes data from storage service 356 including index storage 324. The storage service 356 may include its own local storage processing module 360, and the query executor 344 can delegate various processing tasks to the local storage processing module 360. The processed data is then provided by the storage processing module 360 to the query executor 344 to construct a response to the received query. In a cloud-based implementation, both storage service 356 and query executor 344 may be implemented in a computing cloud, and index storage 324 may be stored in a computing instance. Index storage 324 may be mirrored to near-line storage, such as in storage cloud 116 shown in FIG. 1A.

図２Ｂにおいて、データロードモジュール３７０は、データウェアハウス１９２が理解するフォーマットで、データファイルを生成する。例えば、データウェアハウス１９２は、大量のデータをロードするためのＳＱＬＣｏｐｙＦｒｏｍコマンドをサポートしてよい。このコマンドによって処理されるデータファイルは、所定の型を有してよく、その型は、ＣＳＶ（ｃｏｍｍａ−ｓｅｐａｒａｔｅｄｖａｒｉａｂｌｅ）の変形であってよい。関係スキーマの各関係について、データウェアハウス１９２にロードするための中間ファイルが作成される。データウェアハウス１９２が並列ロードをサポートする場合、大きいファイルの一部または全てを、並列ロード用に複数のファイルに分割してよい。これらの中間ファイルのためのデータは、インデックスストレージ１２４から取り出してもよく、及び／または、データソース３００を通る第２のパスから取り出してもよい。ユーザインタフェース３７４は、データウェアハウス１９２へのアクセスをユーザ１３０に提供する。ほんの一例として、ユーザインタフェース３７４は、データウェアハウス１９２の一部として備えられてよい。他の実装においては、ユーザインタフェース３７４は、コマンドをデータウェアハウス１９２に渡してよく、及び／または、データウェアハウス１９２が実行するＳＱＬコマンドを作成してよい。 In FIG. 2B, the data load module 370 generates data files in a format that the data warehouse 192 understands. For example, data warehouse 192 may support SQL Copy From commands for loading large amounts of data. The data file processed by this command may have a predetermined type, which may be a variant of comma-separated variable (CSV). For each relationship in the relationship schema, intermediate files are created for loading into the data warehouse 192. If the data warehouse 192 supports parallel loading, some or all of the large files may be split into multiple files for parallel loading. Data for these intermediate files may be retrieved from the index storage 124 and / or may be retrieved from a second pass through the data source 300. User interface 374 provides user 130 with access to data warehouse 192. By way of example only, user interface 374 may be included as part of data warehouse 192. In other implementations, the user interface 374 may pass commands to the data warehouse 192 and / or create SQL commands that the data warehouse 192 executes.

図２Ｃにおいて、ユーザインタフェース３７６は、プロキシ３４８を介してクエリエグゼキュータ３４４と通信する。クエリエグゼキュータ３４４は、一部のクエリをデータウェアハウス１９２に渡してよい。様々なクエリについて、クエリエグゼキュータ３４４は、ストレージ処理モジュール３６０とメタデータサービス３４０からのデータに基づいて、クエリの一部を行ってよく、クエリの他の部分をデータウェアハウス１９２に渡してよい。クエリエグゼキュータ３４４は、結果を組み合わせ、組み合わせた出力をプロキシ３４８に渡してよい。様々な実装において、ユーザインタフェース３７６は一部の関係またはデータがデータウェアハウス１９２またはインデックスストレージ３２４に記憶されているか否かを、ユーザ１３０に対して透明にしてよい。この特徴は、何らかのデータを既にデータウェアハウス１９２に記憶しており、かつ、インデックスストレージ３２４に新しいデータをロード中またはその逆を行っている、カスタマーに関係してよい。 In FIG. 2C, user interface 376 communicates with query executor 344 via proxy 348. The query executor 344 may pass some queries to the data warehouse 192. For various queries, query executor 344 may perform part of the query based on data from storage processing module 360 and metadata service 340 and may pass other parts of the query to data warehouse 192 . The query executor 344 may combine the results and pass the combined output to the proxy 348. In various implementations, user interface 376 may be transparent to user 130 whether some relationships or data are stored in data warehouse 192 or index storage 324. This feature may relate to a customer who has already stored some data in data warehouse 192 and is loading new data into index storage 324 or vice versa.

図２Ｄは、ユーザインタフェース３７６の実装例の高レベル機能ブロック図である。スキーマ表現モジュール３７８は、スキーマ監視モジュール３８０からスキーマデータを受信する。スキーマ監視モジュール３８０は、メタデータサービス３４０から関係スキーマに関する情報を受信する。表示モジュール３８２は、１つまたは様々なフォーマットで、ユーザ１３０にスキーマを表示する。例えば、ネストされた属性の階層は、リスト形式においてインデントのレベルで示してよい。あるいは、視覚的なツリーフォーマットを用いてよい。 FIG. 2D is a high level functional block diagram of an example implementation of user interface 376. As shown in FIG. The schema representation module 378 receives schema data from the schema monitoring module 380. The schema monitoring module 380 receives information on the relationship schema from the metadata service 340. The display module 382 displays the schema to the user 130 in one or various formats. For example, the hierarchy of nested attributes may be shown at the indent level in a list format. Alternatively, a visual tree format may be used.

メタデータサービス３４０がスキーマ監視モジュール３８０にスキーマの変更を知らせると、スキーマ表現モジュール３７８は、新しい属性、新たなサブ属性等を挿入することによって、スキーマを更新してよい。例えば、新たな中間ノード及び新たなリーフノードを含む、新たなノードをツリーに追加してよい。スキーマが変わるにつれて、リーフノードは中間ノードに変換されてよく、属性の型の変化は、色、ラベル、または、二次的な印で視覚的に表されてよい。ほんの一例として、二次的な印は、（例えば、マウスまたはトラックパッドを用いて）属性の上にカーソルをホバリングした時に現れてよい。 When the metadata service 340 informs the schema monitoring module 380 of a schema change, the schema representation module 378 may update the schema by inserting new attributes, new sub-attributes, etc. For example, new nodes may be added to the tree, including new intermediate nodes and new leaf nodes. As the schema changes, leaf nodes may be converted to intermediate nodes, and changes in attribute types may be visually represented by colors, labels, or secondary marks. By way of example only, secondary marks may appear when hovering the cursor over an attribute (eg, using a mouse or track pad).

スキーマに変更が加えられるとき、表示モジュール３８２は、表示領域の中心に現在表示されているスキーマに焦点を当て続けようとしてよい。ほんの一例として、多くの新しい属性が、リストされたスキーマの最初に追加される場合、前からリストされている属性は、画面上で下に移動し得る、または、画面上から見えなくなる場合がある。この視覚的に破壊的な変更に対処するために、表示モジュール３８２は、適切に一定の中央位置を維持するように属性のリストをスクロールしてよい。スキーマに追加された要素は、輪郭で囲む、シャドーイング、色、フォント（大文字、イタリック体、サイズ）などで、視覚的に表してよい。 As changes are made to the schema, the display module 382 may attempt to keep the schema currently displayed in the center of the display area focused. As just one example, if many new attributes are added to the beginning of the listed schemas, the previously listed attributes may move down on the screen or may disappear from the screen . To address this visually destructive change, display module 382 may scroll through the list of attributes to maintain a consistent center position appropriately. Elements added to the schema may be represented visually by outline, shadowing, color, font (uppercase, italic, size), etc.

ほんの一例として、色のグラデーションで、スキーマ要素が、どのくらい最近、変化したのかを示してよい。ほんの一例として、非常に鮮やかな青は、最近変更されたスキーマ要素を示してよく、青が薄くなって最終的には白になることは、長い間、存在したスキーマ要素であることを示す。 As just one example, a color gradient may indicate how recently the schema element has changed. By way of example only, very vivid blue may indicate a recently modified schema element, and that thinning of the blue and eventually becoming white indicates that it is a schema element that has existed for a long time.

様々な実装において、表示モジュール３８２は、表示モジュール３８２の要素にユーザがフォーカスしている時を決定するために、マウスの動き、キーボードの使用、オペレーティングシステムのどのウィンドウにフォーカスしているか、及び、節電のために表示をブランクにしているか否かを追跡してよい。例えば、ユーザが、この一時間、表示モジュール３８２と対話しなかったと、表示モジュール３８２が決定する場合、表示モジュール３８２は、この一時間に追加された全てのスキーマ要素を鮮やかな青のままにしてよい。ユーザが、もう一度、表示モジュール３８２と対話を開始した時点で、鮮やかな青は、薄くなり始めてよい。このように、積極的に表示モジュール３８２を監視していたか否かに関わらず、ユーザが最後に表示モジュール３８２と対話した後のメタデータへの変化に注意を向けさせるようになっている。 In various implementations, the display module 382 may move the mouse, use the keyboard, which operating system window to focus on to determine when the user is focusing on the elements of the display module 382, and You may track whether you have blanked the display to save power. For example, if the display module 382 determines that the user has not interacted with the display module 382 for this one hour, the display module 382 leaves all schema elements added during this one hour bright blue Good. When the user starts interacting with the display module 382 again, the bright blue may begin to fade. Thus, regardless of whether or not the display module 382 was actively monitored, attention is paid to changes to metadata after the user interacts with the display module 382 last.

ユーザインタフェース３７６は、１つまたは複数のクエリの結果を表示する結果表現モジュール３８４も含む。結果は、テーブル、チャート、及び、グラフを含む、テキスト形式及びグラフィック形式の組み合わせで表示されてよい。視覚的表現の種類は、アクセスラベル、線形または対数スケーリング、及び、チャートタイプを選択することができ、ユーザによって選択されてよい。クエリエグゼキュータ３４４は、クエリ完了前に、結果の提供を始めてよい。 The user interface 376 also includes a results representation module 384 that displays the results of one or more queries. The results may be displayed in a combination of textual and graphical forms, including tables, charts, and graphs. The type of visual representation may select access label, linear or logarithmic scaling, and chart type, and may be selected by the user. The query executor 344 may begin providing results before the query completes.

結果監視モジュール３８６は、結果がさらに入手可能な時、クエリエグゼキュータ３４４によって通知される。そうすると、結果表現モジュール３８４は、結果のビューを更新し、表示モジュール３８２に提示する。様々な他の実装において、結果監視モジュール３８６は、追加の結果が入手可能な時を決定するためにクエリエグゼキュータ３４４にポーリングしてよい。クエリエグゼキュータ３４４は、タイムスケジュールで、または、処理されたレコードの数などの別のメトリクスに基づいて、これらの増分結果を提供してよい。結果監視モジュール３８６が追加の結果を検出すると、結果表現モジュール３８４は、その追加データを収容するために軸のスケーリングの調整が必要な場合があり、バーチャートやパイチャートに追加のバーやスライス（項目）を追加してよく、及び、チャートの要素に割り当てられた値を調整してよい。 The result monitoring module 386 is notified by the query executor 344 when the results are further available. Then, the result presentation module 384 updates the view of the result and presents it to the display module 382. In various other implementations, the results monitoring module 386 may poll the query executor 344 to determine when additional results are available. The query executor 344 may provide these incremental results on a time schedule or based on another metric, such as the number of records processed. When the results monitoring module 386 detects additional results, the results representation module 384 may need to adjust the scaling of the axes to accommodate that additional data, and additional bars or slices in the bar chart or pie chart ( Items may be added, and the values assigned to the elements of the chart may be adjusted.

単純化した実施例として、データセットの各学年レベルに関する平均ＧＰＡを要求するクエリを考える。クエリエグゼキュータ３４４がデータを処理すると、最初の結果は、初期のレコードの平均ＧＰＡを表示する。追加のレコードが構文解析されると、ＧＰＡは更新されてよい。さらに、クエリでまだ観察されていない学年レベルが、結果表現モジュール３８４によって結果に追加されることになる。 As a simplified example, consider a query that requires an average GPA for each grade level in the data set. When the query executor 344 processes the data, the first result displays the average GPA of the initial records. The GPA may be updated as additional records are parsed. In addition, grade levels not yet observed in the query will be added to the results by the result representation module 384.

様々なアプリケーションにおいて、また、様々なデータセットについて、かなりの数のレコードがまた構文解析されていない間に、平均等の様々なメトリクスが収束し始めてよい。これによって、データ傾向の高速な視覚化が可能になり、また、クエリの完了を待たずに、ユーザによるクエリの調整もしくはクエリの再編成が可能になり得る。これは、分単位または時間単位など、実行に時間がかかるクエリについて、特に価値があると思われる。一部のクエリに関しては、最初の結果を見て、ユーザに関連する結果を返すためにクエリの再形成が必要であることを、ユーザに示してよい。 In various applications, and also for various data sets, various metrics such as average may begin to converge while a significant number of records are not being parsed again. This allows for fast visualization of data trends, and may allow the user to adjust the query or reorganize the query without waiting for the query to complete. This may be particularly valuable for queries that take a long time to execute, such as minutes or hours. For some queries, the first result may be viewed to indicate to the user that the query needs to be reformed to return relevant results to the user.

単純化した実施例に戻ると、ＳＱＬクエリは、“ＳＥＬＥＣＴｓｔｕｄｅｎｔ＿iｄ，ａｖｇ（ｇｐａ）ＦＲＯＭｓｔｕｄｅｎｔｓＧＲＯＵＰＢＹｃｌａｓｓＯＲＤＥＲＢＹ２ＤＥＳＣＥＮＤＩＮＧ；”の形をとってよい。 Returning to the simplified example, the SQL query may take the form "SELECTstudent_id, avg (gpa) FROM students GROUP BY class ORDER BY 2 DESCENDING;".

クエリ管理モジュール３８８は、表示モジュール３８２でユーザが入力したクエリをクエリエグゼキュータ３４４に供給する。クエリ管理モジュール３８８は、以前実行したクエリを記憶し、それらのクエリの再実行を可能にしてよい。さらに、クエリ管理モジュール３８８は、ユーザによる複合クエリの構築、及び／または、前のクエリの結果の組み合わせを支援してよい。 The query management module 388 supplies the query entered by the user in the display module 382 to the query executor 344. The query management module 388 may store previously executed queries and enable re-execution of those queries. Further, the query management module 388 may assist the user in constructing a compound query and / or combining the results of previous queries.

図２Ｅにおいては、高レベル機能図が、複数ノード４０２‐１、４０２‐２、及び、４０２‐３（集合的に、ノード４０２）を有するストレージサービス３５６を示す。３つのノード４０２を示すが、使用するノードの数は３つより多くても少なくてもよく、分析フレームワークのニーズに基づいて動的に変更されてよい。ノード４０２の数は、記憶する必要のあるデータが多くなるにつれて、また、クエリ実行、及び／または、冗長性提供のために追加処理が要求されるのに応答して、増やされてよい。クエリエクゼキュータ３４４を、ノード４０６‐１、４０６‐２、及び、４０６‐３（集合的に、ノード４０６）と共に示す。ノード４０６の数も、クエリ負荷に基づいて動的に変更することができ、ノード４０２の数と無関係である。 In FIG. 2E, a high level functional diagram shows a storage service 356 having multiple nodes 402-1, 402-2, and 402-3 (collectively, nodes 402). Although three nodes 402 are shown, the number of nodes used may be more or less than three, and may be dynamically changed based on the needs of the analysis framework. The number of nodes 402 may be increased as more data needs to be stored, and also in response to query execution and / or additional processing required to provide redundancy. Query executor 344 is shown with nodes 406-1, 406-2, and 406-3 (collectively, nodes 406). The number of nodes 406 can also change dynamically based on the query load, and is independent of the number of nodes 402.

図２Ｆにおいて、ストレージサービス３５６は、データウェアハウス１９２にロードするデータを供給してよい。メタデータサービス３４０は、関係スキーマをデータウェアハウス１９２に供給してよく、データウェアハウス１９２は、その関係スキーマからテーブルを定義できる。データウェアハウス１９２は、単なるストレージを超えて、複数の機能構成要素を備えてよい。備える機能構成要素には、クエリエグゼキュータ４２０及びＯＤＢＣインターフェース４２４が含まれるが、それらに限定されない。ユーザインタフェース３７６は、データウェアハウス１９２と通信する。様々な実装において、ユーザインタフェース３７６は、図２Ｅのクエリエグゼキュータ３４４とも通信してよい。 In FIG. 2F, storage service 356 may provide data to load into data warehouse 192. Metadata service 340 may supply the relationship schema to data warehouse 192, which may define tables from the relationship schema. Data warehousing 192 may include multiple functional components beyond simple storage. The functional components that comprise include, but are not limited to, the query executor 420 and the ODBC interface 424. User interface 376 is in communication with data warehouse 192. In various implementations, user interface 376 may also be in communication with query executor 344 of FIG. 2E.

プロキシ３４８は、ＯＤＢＣクライアント３５２とクエリエクゼキュータ３４４との間のインターフェースを提供する。クエリエクゼキュータ３４４は、メタデータサービス３４０と対話する。メタデータサービス３４０は、ストレージサービス３５６内にあるデータのスキーマを記憶する。 Proxy 348 provides an interface between ODBC client 352 and query executor 344. The query executor 344 interacts with the metadata service 340. Metadata service 340 stores the schema of the data present in storage service 356.

プロセス
図３は、データ採集プロセスの例を示す。制御は５０４で始まり、ここで、ユーザまたはアドミニストレータなどによって、データソースを指定することができる。さらに、データソースから一定のデータセットを選択してよく、一定のサブセッティング操作及び低減操作をデータソースに要求してよい。制御は５０８に続き、ここで、指定されたデータソースは、新しいデータを求めて監視される。 Process FIG. 3 shows an example of the data collection process. Control begins at 504, where a data source can be specified, such as by a user or administrator. In addition, certain data sets may be selected from the data sources, and certain subsetting and reduction operations may be required of the data sources. Control continues to 508 where the designated data source is monitored for new data.

５１２において、新しいデータオブジェクトがデータソースに追加されている場合、制御は５１６に移る。追加されていない場合、制御は５０４に戻って、必要に応じてデータソースの修正が可能なようにする。５１６において、新しいオブジェクトのスキーマが推論される。推論は、図４に示すような型関数に従って行われてよい。５２０において、５１６で推論されたスキーマが、既に存在しているスキーマとマージされる。マージは、図５に示すようなマージ関数に従って行われてよい。 At 512, if a new data object has been added to the data source, control passes to 516. If not, control returns to 504 to allow modification of the data source as needed. At 516, the schema of the new object is inferred. The inference may be performed according to a type function as shown in FIG. At 520, the schema inferred at 516 is merged with the already existing schema. The merging may be performed according to a merging function as shown in FIG.

５２４において、装飾が求められる場合、制御は５２８に移り、そうではない場合、制御は５３２に移る。５２８において、マップが、図８に示されるように、データ内で識別される。５３６において、新たなマップが識別されない場合、制御は５３２に続き、新たなマップが識別された場合、制御は５４０に移る。５４０において、マップインデックスが求められる場合、制御は５４４に移り、そうではない場合、制御は５３２に続く。５４４において、新たなマップ属性と関連付けられたＢｉｇＩｎｄｅｘまたはＡｒｒａｙＩｎｄｅｘの各値について、その値は、マップインデックスに追加される。さらに、ユーザ及び／またはアドミニストレータが求める場合、特定の属性について、値が値インデックスに追加される。制御は次に５３２に続く。 At 524, if a decoration is desired, control passes to 528, otherwise control passes to 532. At 528, a map is identified in the data, as shown in FIG. At 536, if a new map is not identified, control continues to 532 and if a new map is identified, control passes to 540. If, at 540, a map index is determined, control passes to 544; otherwise, control continues to 532. At 544, for each BigIndex or ArrayIndex value associated with the new map attribute, that value is added to the map index. Additionally, values may be added to the value index for specific attributes, as required by the user and / or administrator. Control then continues to 532.

様々な実装において、５２４における装飾は、オブジェクトの第１のラウンドが処理されるまで待機してよい。例えば、初期採集時、装飾は、初期オブジェクトの全てが採集されるまで遅延されてよい。このようにして、マップのヒューリスティックスが使用するための十分な統計が収集される。追加オブジェクトの増分採集については、装飾は、追加オブジェクトの各新しいグループの後に実行されてよい。 In various implementations, the decoration at 524 may wait until the first round of objects is processed. For example, at initial collection, the decoration may be delayed until all of the initial objects have been collected. In this way, enough statistics are collected for the map heuristics to use. For incremental collection of additional objects, decoration may be performed after each new group of additional objects.

５３２において、ＪＳＯＮスキーマが新しいオブジェクトの結果として変更された場合、制御は５４８に移り、ここで、スキーマは関係スキーマに変換される。制御は５５２に続き、ここで、１対１関係を平坦化することなどによって、関係ビューが最適化される。制御は次に５５６に続く。スキーマが５３２で変更されていない場合、制御は直接、５５６に移る。５５６において、インデックスは、新しいオブジェクトのデータをポピュレートされる。これは図７に示すように行われてよい。制御は次に５０４に戻る。 At 532, if the JSON schema is changed as a result of the new object, control is transferred to 548 where the schema is converted to a relationship schema. Control continues to 552, where the relationship view is optimized, such as by flattening the one-to-one relationship. Control then continues to 556. If the schema has not been changed at 532 then control passes directly to 556. At 556, the index is populated with data for the new object. This may be done as shown in FIG. Control then returns to 504.

インデックスのポピュレーションは、５４８で推論されたスキーマを関係スキーマに変換した後に実行されるとして、５５６に示しているが、様々な実装において、関係スキーマが要求されないので、インデックスは、関係スキーマの生成前にポピュレートされてよい。手順は、推論されたＪＳＯＮスキーマを用いて、経路及び結合キーを生成することができる。関係スキーマは、基礎的半構造データの関係ビューとして機能する。 The population of indexes is shown in 556 as being executed after converting the schema inferred 548 to a relational schema, but in various implementations the relational schema is not required, so the index generates a relational schema It may be populated before. The procedure can generate path and binding keys using the inferred JSON schema. A relationship schema acts as a relationship view of underlying semi-structured data.

図４は、再帰に依存する型関数の実装例を示す。制御は６０４で始まり、ここで、型付けするオブジェクトがスカラである場合、制御は６０８に移る。６０８において、スカラの型が決定され、そのスカラ型は、６１２で関数の出力として戻される。スカラ型は、受信されたオブジェクトにおける自己記述に基づいて決定されてよい。さらに、一定の文字列が日付または時間などのデータを表すことを認識し得る更なる型付け規則が、使用されてよい。 FIG. 4 shows an implementation example of a type function that relies on recursion. Control begins at 604 where control transfers to 608 if the object to be typed is a scalar. At 608, the type of the scalar is determined, and the scalar type is returned at 612 as the output of the function. Scalar types may be determined based on self-description in the received object. In addition, additional typing rules may be used that may recognize that certain strings represent data such as date or time.

６０４において、オブジェクトがスカラではない場合、制御は６１６に移る。６１６において、オブジェクトが配列である場合、制御は６２０に移り、ここで、型関数（図４）は、配列の各要素に対して再帰的に呼び出される。これらの型関数の結果が受信されると、制御は６２４に続き、ここで、図６に示すような折り畳み関数が、６２０で決定された要素型の配列に対して呼び出される。折り畳まれた配列が折り畳み関数によって返されると、その折り畳まれた配列が、６２８で型関数によって返される。 At 604, if the object is not a scalar, control passes to 616. At 616, if the object is an array, control is transferred to 620 where the type function (FIG. 4) is called recursively for each element of the array. When the results of these type functions are received, control continues to 624 where the folding function as shown in FIG. 6 is called on the array of element types determined at 620. When a folded array is returned by the folding function, the folded array is returned by the type function at 628.

６１６において、オブジェクトが配列ではない場合、制御は６３２に移る。６３２において、型関数（図４）は、オブジェクトの各フィールドに対して再帰的に呼び出される。制御は６３６に続き、ここで、折り畳み関数は、６３２で決定されたフィールド型の連結に対して呼び出される。折り畳み関数によって返された折り畳まれたオブジェクトは、次に、６４０において型関数によって返される。 At 616, if the object is not an array, control passes to 632; At 632, the type function (FIG. 4) is called recursively for each field of the object. Control continues at 636 where the folding function is called for the field type concatenation determined at 632. The folded object returned by the folding function is then returned by the type function at 640.

図５は、２つのスキーマ要素を単一のスキーマ要素にマージするマージ関数の実装例を示す。マージ関数も再帰的であり、最初に呼び出されるとき、２つのスキーマ要素は、既存のスキーマと、新たに受信されたオブジェクトから推論された新しいスキーマと、である。マージ関数の更なる再帰的呼び出しでは、スキーマ要素は、これらのスキーマのサブ要素である。制御は７０４で始まり、ここで、マージしようとするスキーマ要素が等価である場合、制御は７０８に移り、等価なスキーマ要素のいずれか１つを返す。そうではない場合、制御は７１２に移り、ここで、マージしようとするスキーマ要素が両方とも配列である場合、制御は７１６に移り、そうではない場合、制御は７２０に移る。 FIG. 5 shows an implementation example of a merge function that merges two schema elements into a single schema element. The merge function is also recursive, and when first called, the two schema elements are the existing schema and the new schema inferred from the newly received object. For further recursive calls of merge functions, schema elements are subelements of these schemas. Control begins at 704, where if the schema elements to be merged are equivalent, control transfers to 708, which returns one of the equivalent schema elements. If not, control is transferred to 712, where control is transferred to 716 if both schema elements to be merged are arrays, otherwise control is transferred to 720.

７１６において、マージしようとする配列の一方が空である場合、他方の配列は７２４で返される。そうではない場合、制御は７２８に続き、ここで、図６に示すような折り畳み関数が、マージしようとする両方の配列の要素を含む配列に対して呼び出される。折り畳み関数によって返された折り畳まれた配列は、次に、７３２でマージ関数によって返される。 At 716, if one of the arrays to be merged is empty, the other array is returned at 724. Otherwise, control continues to 728 where a folding function as shown in FIG. 6 is called on the array containing the elements of both arrays to be merged. The folded array returned by the folding function is then returned by the merge function at 732.

７２０において、マージしようとするスキーマ要素の一方が空である場合、他方のスキーマ要素は、７３６でマージ関数によって返される。マージしようとする両方のスキーマ要素が空でない場合、制御は７４０に続き、ここで、折り畳み関数が、マージしようとする両方のスキーマ要素のキーと値のペアを含むオブジェクトに対して呼び出される。折り畳み関数によって返された折り畳まれたオブジェクトは、次に、７４４でマージ関数によって返される。 At 720, if one of the schema elements to be merged is empty, the other schema element is returned by the merge function at 736. If both schema elements to be merged are not empty, control continues to 740 where a folding function is called on the object containing key-value pairs of both schema elements to be merged. The folded objects returned by the folding function are then returned by the merge function at 744.

図６は、折り畳み関数の実装例を示す。制御は８０４で始まり、ここで、折り畳もうとするオブジェクトが配列である場合、制御は８０８に移り、そうではない場合、制御は８１２に移る。８０８において、配列が、両方とも配列である値のペアを含む場合、制御は８１６に移り、そうではない場合、制御は８２０に続く。８２０において、配列が、両方ともオブジェクトである値のペアを含む場合、制御は８１６に移り、そうではない場合、制御は８２４に続く。８２４において、配列が、等しいスカラ型である値のペアを含む場合、制御は８１６に移り、そうではない場合、折り畳みが完了し、配列が折り畳み関数から返される。８１６において、図５に示すようなマージ関数が、８０８、８２０、または８２４によって識別された値のペアに対して呼び出される。制御は８２８に続き、ここで、値のペアは、マージ関数によって返された単一の値で置き換えられる。 FIG. 6 shows an implementation example of the folding function. Control begins at 804 where, if the object to be folded is an array, control passes to 808, otherwise control passes to 812. At 808, if the array contains a pair of values that are both arrays, control passes to 816, otherwise control continues to 820. At 820, if the array contains a pair of values that are both objects, control passes to 816, otherwise control continues to 824. At 824, if the array contains a pair of values that are equal scalar types, control passes to 816, otherwise the folding is complete and the array is returned from the folding function. At 816, a merge function as shown in FIG. 5 is called for the value pair identified by 808, 820 or 824. Control continues to 828 where the value pairs are replaced with the single value returned by the merge function.

８１２において、オブジェクト内のキーのいずれかが同じである場合、制御は８３２に移り、そうではない場合、折り畳みが完了し、オブジェクトは返される。８３２において、制御は、同じキーのペアを選択し、８３６に続く。キーのペアについての値が両方とも配列であるか、両方ともオブジェクトである場合、制御は８４０に移り、そうではない場合、制御は８４４に移る。８４０において、キーのペアの値に対してマージ関数が呼び出される。制御は８４８に続き、ここで、キーのペアが、マージ関数によって返された値を有する単一キーで置き換えられる。制御は、次に８５２に続き、ここで、いずれかのキー同士がさらに同じである場合、制御は８３２に移り、そうではない場合、折り畳みは完了し、修正されたオブジェクトが返される。８４４において、キーのペアの値が両方ともスカラである場合、制御は８５６に移り、そうではない場合、制御は８５２に移る。８５６において、キーのペアの値のスカラ型が等しい場合、制御は、それらのキーのペアをマージするために８４０に移り、そうではない場合、制御は８５２に移る。 At 812, if any of the keys in the object are the same, control is transferred to 832, otherwise the folding is complete and the object is returned. At 832, control selects the same key pair and continues to 836. If the values for the key pair are both arrays or both objects, control passes to 840, otherwise control passes to 844. At 840, a merge function is called on the value of the key pair. Control continues to 848 where the key pair is replaced with a single key with the value returned by the merge function. Control then continues to 852 where control is transferred to 832 if any of the keys are further identical, otherwise the folding is complete and the modified object is returned. At 844, if the values of the key pair are both scalar, control passes to 856, otherwise control passes to 852. At 856, if the scalar types of the key pair values are equal, control passes to 840 to merge those key pairs, otherwise control passes to 852.

図７は、新たに取り出したオブジェクトからのデータをインデックスにポピュレートするプロセスの例を示す。制御は９０４で始まり、ここで、ＲｏｗＩｎｄｅｘが求められる場合、制御は９０８に移り、そうではない場合、制御は９１２に移る。９０８において、オブジェクトが、上記のようにＲｏｗＩｎｄｅｘに追加され、制御は９１２に続く。９１２において、オブジェクトは、現在の関係スキーマについて関係タプルに平坦化され、結合キーが必要に応じて作成される。制御は９１６に続き、ここで、制御は、インデックスに追加する更なるタプルが存在するか否かを決定する。存在する場合、制御は９２０に移り、存在しない場合、インデックスはポピュレート済みであり、制御は終了する。 FIG. 7 shows an example of the process of populating the data from the newly retrieved object into an index. Control begins at 904, where if RowIndex is determined, control passes to 908, otherwise control passes to 912. At 908, an object is added to the RowIndex as described above, and control continues to 912. At 912, the objects are flattened into relation tuples for the current relation schema, and join keys are created as needed. Control continues to 916 where control determines whether there are more tuples to add to the index. If so, then control passes to 920; otherwise, the index is populated and control ends.

９２０において、制御は、タプルが配列表についてのものであるか否かを決定する。そうである場合、制御は９２４に移り、そうではない場合、制御は９２８に移る。９２４において、配列表内に更なる値カラムがある場合、制御は９３２に移る。９３２において、カラム値が元の取り出したオブジェクト内に存在する場合、９３６において、その値はＡｒｒａｙＩｎｄｅｘに追加される。制御は、次に９４０に続く。ＶａｌｕｅＩｎｄｅｘがカラムについて求められる場合、制御は９４４に移り、そうではない場合、制御は９２４に戻る。カラム値が、９３２で元の取り出したオブジェクト内に存在しない場合、制御は９２４に戻る。 At 920, control determines whether the tuple is for a sequence list. If so, control passes to 924; otherwise, control passes to 928. At 924, if there are additional value columns in the sequence listing, control passes to 932. At 932, if the column value exists in the original retrieved object, then at 936, the value is added to the ArrayIndex. Control then continues to 940. If a ValueIndex is determined for the column, control passes to 944, otherwise control returns to 924. If the column value is not present at 932 in the original fetched object, control returns to 924.

９２８において、タプルがマップテーブルについてのものである場合、制御は９４８に移り、そうではない場合、制御は９５２に移る。９４８において、制御は、更なる値カラムがマップテーブル内に残っているか否かを決定する。残っている場合、制御は９５６に移り、残っていない場合、制御は９１６に戻る。９５６において、制御は、カラム値が元の取り出したオブジェクト内に存在するか否かを決定する。存在する場合、制御は９６０に移り、存在しない場合、制御は９４８に戻る。９６０において、値は、ＭａｐＩｎｄｅｘに追加され、制御は９６４に移る。９６４において、ＶａｌｕｅＩｎｄｅｘがカラムについて求められる場合、値は、９６８においてＶａｌｕｅＩｎｄｅｘに追加される。いずれの場合においても、制御は次に９４８に戻る。 At 928, if the tuple is for a map table, control passes to 948, otherwise control passes to 952. At 948, control determines whether more value columns remain in the map table. If so, then control transfers to 956, otherwise control returns to 916. At 956, control determines whether the column value is present in the original retrieved object. If present, control passes to 960, and if not, control returns to 948. At 960, the value is added to MapIndex and control passes to 964. If at 964 a ValueIndex is determined for the column, then at 968 the value is added to the ValueIndex. In either case, control then returns to 948.

９５２において、制御は、テーブル内に存在する更なるカラムがあるか否かを決定する。存在する場合、制御は９７２に移り、存在しない場合、制御は９１６に戻る。９７２において、制御は、カラム値が元の取り出したオブジェクト内に存在するか否かを決定する。存在する場合、制御は９７６に移り、存在しない場合、制御は９５２に戻る。９７６では、値がＢｉｇＩｎｄｅｘに追加され、制御は９８０に続く。９８０では、ＶａｌｕｅＩｎｄｅｘがカラムについて求められる場合、制御は９８４に移り、ここで、値がＶａｌｕｅＩｎｄｅｘに追加される。いずれの場合においても、制御は次に９５２に戻る。 At 952, control determines whether there are more columns present in the table. If so, control passes to 972, and if not, control returns to 916. At 972, control determines whether the column value is present in the original retrieved object. If so, control passes to 976; otherwise, control returns to 952. At 976, a value is added to BigIndex and control continues at 980. At 980, if the ValueIndex is determined for the column, control is transferred to 984, where the value is added to the ValueIndex. In either case, control then returns to 952.

図８は、マップを識別するプロセスの例を示す。制御は、１００４で始まり、ここで、第１のオブジェクトが選択される。制御は１００８に続き、ここで、オブジェクトが空である場合、１０１２において、含有オブジェクトがマップとして指定され、空でない場合、制御は１０１６に移る。１０１６において、制御は、上記のような含有オブジェクトの頻度に対する平均フィールド頻度の比率を決定する。制御は１０２０に続き、ここで、比率が閾値未満である場合、制御は１０１２に移って、マップとして含有オブジェクトを指定する。閾値以上の場合、制御は１０２４に移る。ほんの一例として、閾値は、ユーザが調整可能であってよく、及び／または、観察データに基づいて動的であってよい。様々な実装において、関係スキーマとしてのマップが多くなるにつれて、ヒューリスティックスは、フィールドをより容易に識別するように調整されてよい。１０１２において、含有オブジェクトがマップとして指定され、制御は１０２４に続く。評価すべきオブジェクトが更にある場合、制御は１０２８に移り、ここで、次のオブジェクトが選択され、制御は１００８に続き、ない場合、制御は終了する。 FIG. 8 shows an example of the process of identifying a map. Control begins at 1004, where a first object is selected. Control continues to 1008 where, if the object is empty, then at 1012, the containing object is designated as a map, otherwise control passes to 1016. At 1016, control determines the ratio of average field frequency to the frequency of contained objects as described above. Control continues to 1020 where, if the ratio is below the threshold, control transfers to 1012 to specify the contained object as a map. If it is above the threshold, control passes to 1024. By way of example only, the threshold may be user adjustable and / or may be dynamic based on observational data. In various implementations, as there are more maps as relational schemas, the heuristics may be adjusted to more easily identify the fields. At 1012, the containing object is designated as a map and control continues to 1024. If there are more objects to evaluate, control transfers to 1028 where the next object is selected and control continues to 1008, otherwise control ends.

図９は、関係スキーマの作成を再帰に依存するｃｒｅａｔｅ＿ｓｃｈｅｍａ関数の実装例を示す。ｃｒｅａｔｅ＿ｓｃｈｅｍａ関数が呼び出されると、制御は、スキーマ要素（Ｓｃｈｅｍａ＿Ｅｌｅｍｅｎｔ）をテーブル（Ｃｕｒｒｅｎｔ＿Ｔａｂｌｅ）に組み込む。この目的のために、制御は１１０４で始まり、ここで、Ｓｃｈｅｍａ＿Ｅｌｅｍｅｎｔがオブジェクトである場合、制御は１１０８に移り、そうではない場合、制御は１１１２に移る。１１０８では、オブジェクトが空のオブジェクトである場合、オブジェクトはマップとして扱われ、制御は１１１６に移り、そうではない場合、制御は１１２０に続く。１１２０では、新しいテーブル（Ｎｅｗ＿Ｔａｂｌｅ）がネストされたオブジェクトのために作成される。１１２４において、結合キー（Ｊｏｉｎ＿Ｋｅｙ）が、Ｃｕｒｒｅｎｔ＿Ｔａｂｌｅに追加され、１１２８において、対応するＪｏｉｎ＿Ｋｅｙが、Ｎｅｗ＿Ｔａｂｌｅに追加される。制御は次に１１３２に続き、ここで、ネストされたオブジェクトの各フィールドについて、ｃｒｅａｔｅ＿ｓｃｈｅｍａ関数が、フィールドをテーブルに追加するために再帰的に呼び出される。制御は、次に１１３６においてｃｒｅａｔｅ＿ｓｃｈｅｍａ関数の現在の呼び出しから戻る。 FIG. 9 shows an implementation example of a create_schema function that relies on recursion to create a relational schema. When the create_schema function is called, control incorporates a schema element (Schema_Element) into a table (Current_Table). For this purpose, control starts at 1104 where control is transferred to 1108 if Schema_Element is an object, otherwise control is transferred to 1112. At 1108, if the object is an empty object, the object is treated as a map, control passes to 1116, otherwise control continues to 1120. At 1120, a new table (New_Table) is created for the nested object. At 1124, a join key (Join_Key) is added to the Current_Table, and at 1128, the corresponding Join_Key is added to the New_Table. Control then continues to 1132, where for each field of the nested object, the create_schema function is called recursively to add the field to the table. Control then returns at 1136 from the current call of the create_schema function.

１１１２において、Ｓｃｈｅｍａ＿Ｅｌｅｍｅｎｔがマップである場合、制御は１１１６に移り、そうではない場合、制御は１１３８に移る。１１１６において、新しいテーブル（Ｎｅｗ＿Ｔａｂｌｅ）がマップのために作成される。制御は１１４０に続き、ここで、Ｊｏｉｎ＿Ｋｅｙが、Ｃｕｒｒｅｎｔ＿Ｔａｂｌｅに追加され、１１４４において、対応するＪｏｉｎ＿Ｋｅｙが、Ｎｅｗ＿Ｔａｂｌｅに追加される。１１４８において、文字列型を有するキーフィールドが、Ｎｅｗ＿Ｔａｂｌｅに追加される。制御は１１５２に続き、ここで、マップの各値型について、ｃｒｅａｔｅ＿ｓｃｈｅｍａ関数が、値型をＮｅｗ＿Ｔａｂｌｅに追加するために再帰的に呼び出される。制御は次に１１３６に戻る。 At 1112, if the Schema_Element is a map, control passes to 1116, otherwise control passes to 1138. At 1116, a new table (New_Table) is created for the map. Control continues to 1140 where the Join_Key is added to the Current_Table, and at 1144 the corresponding Join_Key is added to the New_Table. At 1148, a key field having a string type is added to New_Table. Control continues to 1152 where, for each value type in the map, the create_schema function is called recursively to add the value type to New_Table. Control then returns to 1136.

１１３８において、制御は、Ｓｃｈｅｍａ＿Ｅｌｅｍｅｎｔが配列であるか否かを決定する。配列である場合、制御は１１５６に移り、配列でない場合、制御は１１６０に移る。１１５６において、新しいテーブル（Ｎｅｗ＿Ｔａｂｌｅ）が配列のために作成され、１１６４において、Ｊｏｉｎ＿ＫｅｙがＣｕｒｒｅｎｔ＿Ｔａｂｌｅに追加され、１１６８において、対応するＪｏｉｎ＿ＫｅｙがＮｅｗ＿Ｔａｂｌｅに追加される。１１７２において、整数型を有するインデックスフィールドが、Ｎｅｗ＿Ｔａｂｌｅに追加される。制御は１１７６に続き、ここで、配列の各アイテム型について、アイテム型をＮｅｗ＿Ｔａｂｌｅに追加するためにｃｒｅａｔｅ＿ｓｃｈｅｍａ関数が呼び出される。制御は次に１１３６に戻る。 At 1138, control determines whether the Schema_Element is an array. If so, control passes to 1156; otherwise, control passes to 1160. At 1156, a new table (New_Table) is created for the array, at 1164 a Join_Key is added to the Current_Table, and at 1168 the corresponding Join_Key is added to the New_Table. At 1172, an index field having an integer type is added to New_Table. Control continues to 1176 where, for each item type in the array, the create_schema function is called to add the item type to New_Table. Control then returns to 1136.

１１６０において、消去プロセスによって、Ｓｃｈｅｍａ＿Ｅｌｅｍｅｎｔは、プリミティブである。プリミティブと同じ名前を有するＣｕｒｒｅｎｔ＿Ｔａｂｌｅに既にフィールドがある場合、制御は１１８０に移り、そうではない場合、制御は１１８４に移る。１１８４において、名前フィールドがＣｕｒｒｅｎｔ＿Ｔａｂｌｅに単純に加えられ、制御は１１３６に戻る。１１８０において、型多様性が存在するので、プリミティブと同じ名前を有するＣｕｒｒｅｎｔ＿Ｔａｂｌｅの既存のフィールドが、名前を変更されて、それらの型をフィールド名に追加する。制御は１１８８に続き、ここで、新しいフィールドが現在のプリミティブに基づいて追加され、型はフィールド名に添えられる。制御は次に１１３６に戻る。 At 1160, by the erasure process, the Schema_Element is a primitive. If there is already a field in the Current_Table having the same name as the primitive, control passes to 1180, otherwise control passes to 1184. At 1184, the name field is simply added to the Current_Table and control returns to 1136. At 1180, as type diversity exists, existing fields in the Current_Table that have the same name as the primitive are renamed to add their type to the field name. Control continues to 1188 where a new field is added based on the current primitive and the type is appended to the field name. Control then returns to 1136.

図１０Ａにおいて、制御は、１２０４で開始し、ここで、ユーザまたはアドミニストレータは、データのソースを指定及び／または修正する。１２０８において、制御は、そのデータからスキーマを推論し、上記に詳述したように、推論中のスキーマをインデックスにポピュレートする。１２１２において、制御は、装飾が求められているか否かを決定する。装飾を求めるか否かは、ユーザまたはアドミニストレータによって設定されてよい。求めている場合、制御は、１２１６に移り、求めていない場合、制御は１２２０に移る。１２１６において、制御は、スキーマ内のマップを識別し、識別したマップを反映するようにスキーマを更新する。ユーザ及び／またはアドミニストレータからの設定に基づいて、一定の識別されたマップは、ユーザ及び／またはアドミニストレータによって、別個のカラムに手動で戻すことができる。これは、採集時、または、データ使用時のいつでも、行ってよい。マップが識別され、マップインデックスが作成されると、データは、マップインデックス内にとどまってよく、その結果、スキーマは、マップを反映することができる、または、カラムを個別に反映することができる、また、ユーザまたはアドミニストレータは、データをリロードすることなしに、これらの構成の間を切り替えることができる。制御は１２２０に続く。１２２０においては、制御は、推論されたスキーマを関係スキーマに変換する。１２２４において、制御は、使用中の特定のデータウェアハウスが認識可能なフォーマットにデータをパッケージ化する。１２２８において、関係スキーマに従ったテーブルが、データウェアハウス内に作成される。ほんの一例として、ＳＱＬｃｒｅａｔｅｔａｂｌｅコマンドを用いてよい。１２３２において、パッケージ化されたデータは、データウェアハウスにバルクロードされる。データウェアハウスが、並列にバルクロードできる場合、１２２４におけるデータのパッケージ化は、１２３２のバルクロードを速めるために、各データベース関係について複数のファイルを作成してよい。 In FIG. 10A, control begins at 1204, where a user or administrator specifies and / or modifies a source of data. At 1208, control infers a schema from the data and populates the index with the schema being inferred, as detailed above. At 1212, control determines whether a decoration is required. Whether to request decoration may be set by the user or the administrator. If so, control passes to 1216; otherwise, control passes to 1220. At 1216, control identifies the map in the schema and updates the schema to reflect the identified map. Based on settings from the user and / or administrator, certain identified maps can be manually returned to separate columns by the user and / or administrator. This may be done at the time of collection or whenever data is used. Once the map is identified and the map index is created, the data may remain in the map index so that the schema can reflect the map or reflect the columns individually Also, the user or administrator can switch between these configurations without reloading the data. Control continues to 1220. At 1220, control transforms the inferred schema into a relational schema. At 1224, the control packages the data into a format recognizable by the particular data warehouse in use. At 1228, a table according to the relationship schema is created in the data warehouse. As an example only, the SQL create table command may be used. At 1232, the packaged data is bulk loaded into the data warehouse. If the data warehouse can bulk load in parallel, packaging the data at 1224 may create multiple files for each database relationship to speed up the 1232 bulk load.

図１０Ｂにおいて、インデックスストアが現在この特定の時に、利用できない、満杯、または、求められない場合、修正されたプロセスが用いられてよい。１２０４の後、１２５０において、データのスキーマを推論する。図１０Ａと異なり、データは、インデックスのポピュレートに使用されない。１２２０の後、１２２８において、関係スキーマに従ったテーブルがデータウェアハウス内に作成される。制御は、１２５４に続き、ここで、第２のパスがデータに対して実行され、データウェアハウスにローカルロードするための中間ファイルが作成される。制御は、次に１２３２に続き、ここで、バルクロードがデータウェアハウスに再形成される In FIG. 10B, if the index store is currently unavailable, full, or not required at this particular time, a modified process may be used. After 1204, at 1250, the schema of the data is inferred. Unlike FIG. 10A, data is not used to populate the index. After 1220, at 1228, a table according to the relationship schema is created in the data warehouse. Control continues to 1254, where a second pass is performed on the data to create an intermediate file for local loading into the data warehouse. Control then continues to 1232 where the bulk load is reformed into the data warehouse

図１１は、データウェアハウスがサポートする分析プラットフォームに新しいデータを統一するプロセスの例を示す。１３０４において、制御は、新しいデータが、指定されたデータソースから受信されたか否かを決定する。そうである場合、制御は、１３０８に移る。そうでない場合、制御は、１３０４にとどまる。１３０８において、制御は、新しいデータのスキーマを推論し、新しいデータをインデックスにポピュレートする。制御は１３１２に続き、ここで、制御は、新しいデータのスキーマが既に存在するスキーマのサブセットであるか否かを決定する。そうである場合、制御は１３１６に続き、そうでない場合、制御は、１３２０に移る。１３２０において、新しいスキーマは、既存のスキーマとマージされ、制御は１３２４に続く。１３２４において、制御は、装飾が求められているか否かを決定し、求められている場合、制御は１３２８に移り、求められていない場合、制御は１３３２に移る。１３２８において、制御は、新しいデータに基づいてマップを識別する。これらの識別されたマップは、新しいデータが、マップ基準に従ったマップとしてみなされる属性を生じる場合には、新しいデータと以前のデータとを含んでよい。追加のマップが識別される場合、スキーマは更新される。そして、制御は１３３２に続く。１３３２において、マージされたスキーマは、関係スキーマに変換される。１３３６において、テーブルは、データウェアハウス内で修正され、及び／または、新しいテーブルが作成される。１３４０において、ユーザインタフェースは、スキーマが更新されたことと、従って、ユーザインタフェースは、更新されたスキーマをユーザに表示すべきであることとを、知らされる。そして、制御は、１３１６に続く。１３１６において、制御は、インデックスからバルクロードのために新しいデータをパッケージ化する。制御は１３４４に続き、ここで、新たにパッケージ化されたデータのデータウェアハウスへのバルクロードを行う。制御は次に１３０４に続く。 FIG. 11 illustrates an example process for unifying new data into an analysis platform supported by the data warehouse. At 1304, control determines whether new data has been received from the designated data source. If so, control transfers to 1308. If not, control remains at 1304. At 1308, control infers the schema of the new data and populates the new data into an index. Control continues to 1312 where control determines whether the schema of the new data is a subset of an already existing schema. If so, control continues to 1316, otherwise control passes to 1320. At 1320, the new schema is merged with the existing schema, and control continues to 1324. At 1324, control determines whether a decoration is desired, if so, control passes to 1328, and if not, control passes to 1332. At 1328, control identifies the map based on the new data. These identified maps may include new data and previous data if the new data results in an attribute that is considered as a map according to the map criteria. If additional maps are identified, the schema is updated. Control then continues to 1332. At 1332, the merged schemas are converted to relational schemas. At 1336, the table is modified in the data warehouse and / or a new table is created. At 1340, the user interface is informed that the schema has been updated and, accordingly, the user interface should display the updated schema to the user. Control then continues to 1316. At 1316, control packages new data for bulk load from the index. Control continues to 1344 where bulk loading the newly packaged data into the data warehouse. Control then continues to 1304.

図１２は、ユーザインタフェース操作の高レベルの概略図の例である。制御は１４０４で開始し、ここで、推論された関係スキーマがユーザに提示される。制御は１４０８に続き、ここで、スキーマが変更されている場合、制御は１４１２に移り、そうでない場合、制御は１４１６に移る。１４１２において、制御は、ユーザインタフェースで、表示されたスキーマを更新し、１４２０に続く。１４２０において、制御は、オプションで、スキーマへの変更をグラフィカルに識別する。ほんの一例として、最近変更されたスキーマ要素は、視覚的にハイライト表示されてよい。様々な実装において、ユーザインタフェースは、最後のクエリがいつ実行されたかに基づいて、スキーマ要素が最近変更されたか否かを決定してよい。ほんの一例として、最後のクエリが実行されてから変更されたスキーマ要素は、特に、ハイライト表示されてよい。制御は次に１４１６に続く。１４１６において、新たなクエリがユーザによって要求されている場合、制御は１４２４に移り、要求されていない場合、制御は１４２８に移る。１４２４において、制御は、実行されたクエリからのクエリ結果の表示を開始する。これらの結果は、一定のローもしくはカラムが抜けている、及び／または、不正確もしくは部分的に不正確なデータを有するなど、不完全な場合がある。制御は１４２８に続く。１４２８において、進行中のクエリから追加のクエリ結果がある場合、制御は１４３２に移る。ない場合、制御は１４０８に戻る。１４３２において、制御は、追加の結果を用いてインターフェースを更新する。制御は１４３６に続き、ここで、データのプロットが表示されている場合、軸の再スケーリングや再ラベル付けを行うなど、チャートの様々な態様を修正してよい。制御は１４４０に続き、ここで、制御は、クエリ結果への変更をグラフィカルに識別する。ほんの一例として、繰り返し変更されたクエリ結果はハイライト表示される。様々な実装において、大きな割合で、または、大量に変更されたクエリ結果は、より顕著にハイライト表示されてよい。さらに、新しいカラム及び／または新しいローを一意的に識別してよい。さらに、クエリ結果の傾向を視覚的に表すために、ゴースト及び／または色付けによって、現在の値と、以前表示した値とを示してよい。制御は次に１４０８に戻る。 FIG. 12 is an example of a high level schematic of user interface operation. Control begins at 1404, where the inferred relationship schema is presented to the user. Control continues to 1408 where, if the schema has been changed, control transfers to 1412, otherwise control transfers to 1416. At 1412, control updates the displayed schema with the user interface and continues to 1420. At 1420, control optionally identifies changes to the schema. By way of example only, recently modified schema elements may be highlighted visually. In various implementations, the user interface may determine whether the schema element has recently changed based on when the last query was executed. By way of example only, schema elements that have been modified since the last query was executed may be highlighted. Control then continues to 1416. At 1416, if a new query is requested by the user, then control passes to 1424, and if not, control passes to 1428. At 1424, control begins displaying query results from the executed query. These results may be incomplete, such as missing certain rows or columns and / or having incorrect or partially incorrect data. Control continues at 1428. At 1428, if there are additional query results from the ongoing query, control is transferred to 1432. If not, control returns to 1408. At 1432, control updates the interface with additional results. Control continues to 1436 where various aspects of the chart may be modified, such as rescaling and relabeling of the axes if a plot of the data is displayed. Control continues at 1440, where the control graphically identifies changes to the query results. By way of example only, query results that have been repeatedly modified are highlighted. In various implementations, query results that have been changed by a large percentage or a large number may be highlighted more noticeably. Additionally, new columns and / or new rows may be uniquely identified. In addition, ghosting and / or coloring may indicate current values and previously displayed values to visually represent trends in query results. Control then returns to 1408.

グラフィカルユーザインタフェース
図１３Ａにおいて、グラフィカルユーザインタフェースは、左側のウィンドウ枠に推論したスキーマを表示し、右側のウィンドウ枠に、クエリとクエリ結果を示す。これらの例においては、ツイッター属性の表現の例を提示している。 Graphical User Interface In FIG. 13A, the graphical user interface displays the inferred schema in the left pane and the query and query results in the right pane. In these examples, examples of expressions of Twitter attributes are presented.

図１３Ｂにおいては、図１３Ａの後、追加された関係スキーマ属性が現れていることに留意する。詳細には、ｉｎ＿ｒｅｐｌｙ＿ｔｏで始まる属性が、これらの属性を含んだ追加のデータが構文解析されていることに基づいて、ユーザインタフェースに動的に追加された。図１３Ｃは、ネストされたオブジェクトの表現を１つ示している。詳細には、ユーザというノードの下に、属性が示されている。 Note that in FIG. 13B, after FIG. 13A, the added relationship schema attributes appear. In particular, attributes beginning with in_reply_to have been dynamically added to the user interface based on the additional data containing these attributes being parsed. FIG. 13C shows one representation of a nested object. In detail, the attribute is shown under the node of user.

図１３Ｄは、データのテーブルによる表現を提示している。この表示においては、クエリによって１０の言語が見つかった。図１３Ｅにおいては、２４の言語が見つかった。この特定の例においては、元の１０の言語のカウントは変わらず、図１３Ｄの表示と図１３Ｅの表示との間に構文解析されたレコードは、図１３Ｄの最初の１０で示されていない追加の言語であることを示している。 FIG. 13D presents a tabular representation of the data. In this display, the query found 10 languages. In FIG. 13E, 24 languages were found. In this particular example, the counts of the original 10 languages remain unchanged, and the records parsed between the display of FIG. 13D and the display of FIG. 13E are additional not shown in the first 10 of FIG. 13D. It shows that it is the language of.

図１３Ｂにおいて、追加の属性は、図１３Ａで示された表示の次の表示に動的に追加された。図１３Ｅにおいて、追加のクエリ結果は、図１３Ｄに示す表示の次の表示に動的に追加された。 In FIG. 13B, additional attributes have been dynamically added to the next display of the display shown in FIG. 13A. In FIG. 13E, additional query results were dynamically added to the next display of the display shown in FIG. 13D.

自動化された抽出、変換、ロード（ＥＴＬ）
上記に、スキーマ推論を紹介した。スキーマ推論は、半構造オブジェクト（一定の実装においては、ＪＳＯＮドキュメント）のコレクションから累積（または、グローバル）スキーマ（一定の実装においては、ＪＳＯＮスキーマとも呼ばれる)を抽出する。この累積スキーマは、さらなる入力が入手されると、増分的に更新される。累積スキーマは、マップまたは配列に属するエンティティのセットを指定するように装飾されてよい。このような累積スキーマの作成、及び、累積スキーマに基づいたデータの処理は、従来の抽出、変換、ロード、（ＥＴＬ）プロセスの一部として、または、そのプロセスに替えて有利に使用することができる。結果として生じるＥＴＬプロセスは、スピード、忠実度、ユーザビリティのうちの１つまたは複数において改善が見られる。 Automated Extraction, Transformation, Loading (ETL)
Above we introduced schema inference. Schema inference extracts a cumulative (or global) schema (also called a JSON schema in certain implementations) from a collection of semi-structured objects (in certain implementations, JSON documents). This cumulative schema is incrementally updated as more input is obtained. The cumulative schema may be decorated to specify a set of entities belonging to a map or array. The creation of such cumulative schemas and the processing of data based on cumulative schemas can be advantageously used as part of or in place of conventional extraction, transformation, loading, (ETL) processes. it can. The resulting ETL process shows improvements in one or more of speed, fidelity, and usability.

一般論として、ＥＴＬは、選択されたデータセットに一定の変換が行われるオプションとしての変換段階を備えた、１つまたは複数のソース位置から１つまたは複数の宛先位置へのデータの移動を管理するプロセスを指す。変換は宛先の１つの入力フォーマットに従うために必要な場合もある。ソース及び宛先は、関係データベース、オブジェクトストア（例えば、ＮｏＳＱＬまたはキーと値のストア）、または、それらのデータベースもしくはストアのフォーマットに従うデータのレポジトリ（例えば、ＣＳＶもしくはＪＳＯＮデータを含むファイルもしくはドキュメントを記憶する、ローカルもしくは分散ファイルシステム、または、クラウドストア)であってよい。 In general terms, ETL manages the movement of data from one or more source locations to one or more destination locations, with an optional conversion stage where constant conversion is performed on the selected data set Refers to the process of Conversion may be necessary to follow one input format of the destination. Sources and destinations store relational databases, object stores (eg NoSQL or key and value stores), or repositories of data according to the format of those databases or stores (eg files or documents containing CSV or JSON data) , Local or distributed file system, or cloud store).

ＥＴＬの忠実度は、データ項目がどれくらい正確にソースから宛先にマップされるかで定義することができる。例えば、データ項目ａ、ｂ、ｃをソースから宛先にロードすることを課され、結果として項目ｂ、ｃが宛先に存在するＥＴＬプロセスは、結果として項目ＣだけがロードされたＥＴＬプロセスより忠実である。 ETL fidelity can be defined by how precisely the data items are mapped from source to destination. For example, an ETL process imposed to load data items a, b, c from source to destination, resulting in items b, c at the destination being more faithful than ETL processes resulting in only item C being loaded is there.

ＥＴＬのユーザビリティは、ロードされたデータの一部を入力として用いるタスクを宛先コンピュータシステムで行う際に、ユーザ及びコンピュータシステムの両方が行うステップの（削減された）数で測ることができる。例えば、ソースから宛先に結果的にデータ項目ｂ、ｃをロードする２つの異なるＥＴＬプロセスについては、その２つの項目の最大のものを宛先システムで計算するステップの数が一方のプロセスの方が少なければ、ユーザビリティが異なると言える。 ETL's usability can be measured by the (reduced) number of steps taken by both the user and the computer system when performing on the destination computer system the task of using part of the loaded data as input. For example, for two different ETL processes that eventually load data items b, c from source to destination, the number of steps in the destination system to calculate the largest of the two items should be smaller in one process. For example, usability is different.

上記のように、半構造データは、関係形式に変換して、インデックス付きカラム型ストレージをサポートするシステムにロードすることができる。カラム型ストレージに加えて、または、その代わりに、データは、複数のターゲットにロードすることができる。これは、１つまたは複数の（可能であれば、半構造)ソースからデータを取得し、オプションで、１つまたは複数の変換を行い、そのデータを１つまたは複数の（可能であれば、関係）ターゲットにロードするフレキシブルなＥＴＬプロセスを記述するように一般化することができる。様々な実装において、ＥＴＬプロセスは、中間ストレージのためのインデックスストアを用いてもよく、中間ストアを省略してもよい。 As noted above, semi-structured data can be converted to relational form and loaded into systems that support indexed column storage. Data can be loaded into multiple targets in addition to, or instead of, column-based storage. It takes data from one or more (possibly semi-structured) sources, optionally performs one or more transformations, and puts that data into one or more (possibly Relationships can be generalized to describe a flexible ETL process for loading into a target. In various implementations, the ETL process may use an index store for intermediate storage, or may omit the intermediate store.

図１４は、ＥＴＬプロセスの概略を示し、図２Ｂより高いレベルの、より一般化されたバージョンを示す。図１４の左側で、データが１つまたは複数のデータソース（例えば、データソース１５０４、１５０８）からスタートし、データコレクタモジュール１５１２によって、未変換レコードの形で抽出される。データコレクタモジュール１５１２は、その未変換レコードをＪＳＯＮ等の半構造フォーマットで生成してよい。データソース１５０４、１５０８が、ＪＳＯＮオブジェクトを記憶するテキストファイル等の所定のフォーマットのデータを含む場合、データコレクタモジュール１５１２は、ＪＳＯＮオブジェクトは変更しないで渡してよい。様々な実装において、１つまたは複数の追加データコレクタモジュール（図示せず）は、同一のデータソース、または、複数のデータソースからのデータを並列に処理するように実装してよい。さらに、データコレクタモジュールは、スキーマ推論及び統計収集モジュール１５１６とメタデータストア１５４０とが利用可能な形式にデータを徐々に変換するチェーンとして実装してよい。 FIG. 14 shows a schematic of the ETL process, showing a more generalized version at a higher level than FIG. 2B. At the left side of FIG. 14, data is extracted from one or more data sources (eg, data sources 1504, 1508) and extracted by data collector module 1512 in the form of unconverted records. The data collector module 1512 may generate the unconverted record in a semi-structured format such as JSON. If the data sources 1504, 1508 include data in a predetermined format, such as a text file storing JSON objects, the data collector module 1512 may pass the JSON objects unchanged. In various implementations, one or more additional data collector modules (not shown) may be implemented to process data from the same data source or multiple data sources in parallel. Further, the data collector module may be implemented as a chain that gradually transforms the data into a form that the schema inference and statistics collection module 1516 and the metadata store 1540 can use.

収集されたレコードは、スキーマ推論及び統計収集モジュール１５１６に渡される。スキーマ推論及び統計収集モジュール１５１６が決定した累積スキーマに基づいて、データは、１つまたは複数の宛先システム（図１４に示すような、データ宛先システム１５２４、１５２８）に後にロードするために、インデックスストア１５２０にロードすることができる。追加で、あるいは、その代わりに、データは、インデックスストア３２４を迂回して、エクスポートモジュール１５３４経由で、データ宛先システム１５２４または１５２８の１つまたは複数に直接送られてよい。 The collected records are passed to the schema inference and statistics collection module 1516. Based on the cumulative schema determined by the schema inference and statistics collection module 1516, data may be index store for later loading into one or more destination systems (data destination systems 1524, 1528, as shown in FIG. 14). It can be loaded to 1520. Additionally or alternatively, data may be sent directly to one or more of data destination systems 1524 or 1528 via export module 1534, bypassing index store 324.

変換モジュール１５３６は、データ宛先システム１５２４または１５２８の１つまたは両方にエクスポートモジュール１５３４経由でデータを記憶する前に、インデックスストア１５２０からのデータに１つまたは複数の変換を行うように実装してよい。あるいは、変換モジュール１５３６は、インデックスストアを迂回して、スキーマ推論及び統計収集モジュール１５１６から直接受信したデータを変換してよい。 Conversion module 1536 may be implemented to perform one or more conversions on data from index store 1520 prior to storing data via export module 1534 in one or both of data destination systems 1524 or 1528 . Alternatively, conversion module 1536 may bypass the index store and convert data received directly from schema inference and statistics collection module 1516.

エクスポートモジュール１５３４は、採集コマンドに対応しているフォーマットで、データをデータ宛先システム１５２４、１５２８に供給する。ほんの一例として、エクスポートモジュール１５３４は、ＳＱＬベースのデータウェアハウスまたはデータベースのために、テーブルのロー形式でデータを供給してよい。エクスポートモジュール１５３４は、ＪＳＯＮオブジェクトを受け入れるデータ宛先システム１５２４に応答して、ＪＳＯＮオブジェクト等のオブジェクトをデータ宛先システム１５２４に供給してよい。様々な実装において、オブジェクトは、データコレクタモジュール１５１２から、変更されずに渡されてよい。カラムデータを受け入れるデータ宛先システム１５２４に応答して、エクスポートモジュール１５３４は、インデックスストア１５２０からのカラムベースのデータを変更せずに渡してよい。 Export module 1534 provides data to data destination system 1524, 1528 in a format corresponding to the collect command. By way of example only, export module 1534 may provide data in the form of a row of tables for a SQL-based data warehouse or database. Export module 1534 may provide objects such as JSON objects to data destination system 1524 in response to data destination system 1524 accepting JSON objects. In various implementations, objects may be passed unchanged from data collector module 1512. In response to data destination system 1524 accepting column data, export module 1534 may pass column-based data from index store 1520 unchanged.

メタデータストア１５４０は、データコレクタモジュール１５１２、スキーマ推論及び統計収集モジュール１５１６、インデックスストア１５２０のそれぞれの状態を記録する。スケジューリングモジュール１５４４は、ジョブを、データコレクタモジュール１５１２、スキーマ推論及び統計収集モジュール１５１６、インデックスストア１５２０、エクスポートモジュール１５３４に割り当てる。スケジューリングモジュール１５４４は、図１６Ａ及び１６Ｂに関して以下により詳細に述べる依存性に基づいて、ジョブをスケジュールしてよい。監視システム１５４８は、以下に記載するように、データコレクタモジュール１５１２、スキーマ推論及び統計収集モジュール１５１６、インデックスストア１５２０、変換モジュール１５３６、メタデータストア１５４０、並びに、データ宛先システムの１つまたは複数（図１４の例では、データ宛先システム１５２４)の操作に関する性能及びエラーのデータを記録する。 The metadata store 1540 records the state of each of the data collector module 1512, the schema inference and statistics collection module 1516, and the index store 1520. The scheduling module 1544 assigns jobs to the data collector module 1512, the schema inference and statistics gathering module 1516, the index store 1520, and the export module 1534. Scheduling module 1544 may schedule jobs based on the dependencies described in more detail below with respect to FIGS. 16A and 16B. Monitoring system 1548 may include one or more of data collector module 1512, schema inference and statistics collection module 1516, index store 1520, conversion module 1536, metadata store 1540, and data destination system, as described below. Example 14 records performance and error data regarding the operation of the data destination system 1524).

複数のソースからデータを抽出
データソース
オブジェクトソース
入力ソースの定義をＥＴＬプロセスに広げてオブジェクトソースを取り扱う。これらのソースは、ＮｏＳＱＬストア、ＭｏｎｇｏＤＢやＣｏｕｃｈｂａｓｅ等のドキュメントストア、Ｒｅｄｉｓ等のデータ構造ストア、Ｃａｓｓａｎｄｒａ、ＨＢａｓｅ、ＤｙｎａｍｏＤＢ等のキー／多値ストアを含むことができる。ファイルストアのファイル内に記憶されたオブジェクトは、追加のオブジェクトソースとして扱うことができる。ファイルに記憶されたオブジェクトは、ＪＳＯＮ、ＢＳＯＮ、Ｐｒｏｔｏｂｕｆ、Ａｖｒｏ、Ｔｈｒｉｆｔ及びＸＭＬを含むことができる。 Extract Data from Multiple Sources Data Source Object Source Expand the definition of the input source to the ETL process to handle the object source. These sources can include NoSQL stores, document stores such as MongoDB and Couchbase, data structure stores such as Redis, and key / multi-value stores such as Cassandra, HBase, DynamoDB and the like. Objects stored in file store files can be treated as additional object sources. The objects stored in the file can include JSON, BSON, Protobuf, Avro, Thrift and XML.

様々な実装において、データ抽出のために選択された入力ソースの一部または全ての性質は、自動検出されてよい。異なる種類の入力ソースに対して専用コレクタを用いてよい。簡単にするために、今後の記載では、オブジェクトソースの実施例としてＪＳＯＮドキュメントを用いる。本開示は、ＪＳＯＮドキュメントの代わりに、別のタイプのオブジェクトソースを用いることができ、他のオブジェクトソースとＪＳＯＮ間の１対１マッピングがあってよい。言い換えると、本開示は、それら他のタイプのオブジェクトソースに直接、適用することができる。 In various implementations, some or all of the properties of the input source selected for data extraction may be detected automatically. Dedicated collectors may be used for different types of input sources. For simplicity, we will use JSON documents as an example of an object source in the following. The present disclosure may use another type of object source instead of a JSON document, and there may be a one-to-one mapping between other object sources and JSON. In other words, the present disclosure can be applied directly to those other types of object sources.

関係ソース
データベース等の関係ソースもソースとして用いてよい。関係ソースからデータを抽出するプロセスは、ある意味では、ＪＳＯＮスキーマから関係スキーマを生成するプロセスを逆に実行すると考えることができる。プロセスは、各ローをオブジェクトに変換するルートテーブルを識別することで始まる。ルートテーブルは、ユーザが指定してもよく、データコレクタプロセスによって自動的に選択されてもよく、以前のヒューリスティックス、または、ユーザもしくはアドミニストレータが提供した規則セットに基づいて、データコレクタプロセスによって選択されてもよい。データコレクタは、ユーザが後に手動で変更を行うことを前提として、ルートテーブルを選択してよい。 Relational Sources Relational sources such as databases may also be used as sources. The process of extracting data from a relationship source can, in a sense, be thought of as reversing the process of generating a relationship schema from a JSON schema. The process begins by identifying a route table that converts each row into an object. The route table may be specified by the user, may be automatically selected by the data collector process, or selected by the data collector process based on previous heuristics or a set of rules provided by the user or administrator. Also good. The data collector may select the route table, provided that the user later makes manual changes.

そのルートテーブルのローを、他のテーブルのローと結合して、完全なオブジェクトを作成する。例えば、他のテーブルのローは、外部キー関係、統計的ペアリング、または、ユーザ指示を用いて選択してよい。統計的ペアリングは、統計を用いて、２つのカラム間の値の分布の類似性を検出することによって、実装することができる。タイムスタンプとキーカラムは、特に、関係を予測し得るカラムの候補にふさわしい。親テーブルと子テーブルのロー間に１対１の関係がある場合、子テーブルのローは、単純に、主オブジェクトのネストされたオブジェクトになる。１対多の関係がある場合、ローは、ネストされたオブジェクトの配列の一部になる。このプロセスによって、ＪＳＯＮドキュメントにマップすることができるオブジェクトを作成してよく、その時に、本開示におけるＪＳＯＮ処理の記述が直接適用される。 Combine the rows of the root table with the rows of other tables to create a complete object. For example, rows in other tables may be selected using foreign key relationships, statistical pairing, or user instructions. Statistical pairing can be implemented by using statistics to detect the similarity of the distribution of values between two columns. Time stamps and key columns are particularly good candidates for columns that can predict relationships. If there is a one-to-one relationship between the parent and child table rows, then the child table rows simply become the nested objects of the primary object. A row is part of an array of nested objects if there is a one-to-many relationship. This process may create an object that can be mapped to a JSON document, at which time the JSON processing descriptions in this disclosure are applied directly.

実施例として、次のテーブルを考える。
Table: user
Userid, name, age
1, "Nate", 27
2, "Stavros", 87
Table: address
Userid, start, end, city
1, 1995, 2006, Ann Arbor
1, 2006, NULL, Redwood City
2, 2005, 2007, Cambridge
2, 2007, NULL, San Francisco As an example, consider the following table.
Table: user
Userid, name, age
1, "Nate", 27
2, "Stavros", 87
Table: address
Userid, start, end, city
1, 1995, 2006, Ann Arbor
1, 2006, NULL, Redwood City
2, 2005, 2007, Cambridge
2, 2007, NULL, San Francisco

ＪＳＯＮスキーマは、それらのテーブルから次のように推論されてよい。
user : { "userid" : "integer",
"name" : "string",
"age" : "integer",
"address" : [ {
"start" : "integer",
"end" : "integer",
"city" : "string" } 〕
} The JSON schema may be inferred from these tables as follows:
user: {"userid": "integer",
"name": "string",
"age": "integer",
"address": [{
"start": "integer",
"end": "integer",
"city": "string"}]
}

ＪＳＯＮスキーマを用いて、入力テーブルからオブジェクトを作成すると、次のオブジェクトになる。
{ "userid" : 1, "name" : "Nate", "age" : 27,
"address" : [ { "start" : 1995, "end" : 2006",
"city" : "Ann Arbor" },
{ "start" : 2006, "end" : null,
"city" : "Redwood City" } 〕 }
{ "userid" : 2, "name" : "Stavros", "age" : 87,
"address" : [ { "start" : 2005, "end" : 2007",
"city" : "Cambridge" },
{ "start" : 2007, "end" : null,
"city" : "San Francisco" } 〕 } When an object is created from an input table using JSON schema, it becomes the next object.
{"userid": 1, "name": "Nate", "age": 27,
"address": [{"start": 1995, "end": 2006 ",
"city": "Ann Arbor"},
{"start": 2006, "end": null,
"city": "Redwood City"}]}
{"userid": 2, "name": "Stavros", "age": 87,
"address": [{"start": 2005, "end": 2007 ",
"city": "Cambridge"},
{"start": 2007, "end": null,
"city": "San Francisco"}]}

イベント化を通して関係ソースをサポート
一部のケースでは、関係ソースは、単一のルートテーブルに関連付けられていない複数のテーブルを含み得る、または、異なるテーブルの結合方法は、自明でない場合がある。各テーブル（または、テーブルのセット）がタイムスタンプデータを有する少なくとも１つのカラムを含む状況では、コレクタプロセスは、テーブルのセットを次のように、「イベントにする」ことができる（「イベント化」と称する)。カラムとして入力テーブルからの全てのカラムの統合（複数のテーブルに現れる同じカラム名は、新しいテーブルでは、その名前を有する単一のカラムになり得る）と、”ｅｖｅｎｔ”と”ｔｉｍｅ”という少なくとも２つのカラムと、を有する新たな論理的または物理的なイベントテーブルを作成する。これらの名前は、既存のテーブルのカラム名との競合を避けるために、イベント化時に、プログラム的に(所定の接頭辞及び／または接尾辞を有するなど)変更することができる。 Supporting Relationship Sources Through Eventing In some cases, a relationship source may include multiple tables that are not associated with a single route table, or how different tables are joined may not be trivial. In situations where each table (or set of tables) contains at least one column with timestamp data, the collector process can "set event" as follows ("eventize") Called Consolidate all columns from the input table as columns (the same column name that appears in multiple tables can be a single column with that name in the new table) and at least two of "event" and "time" Create a new logical or physical event table with two columns. These names can be changed programatically (such as with a predetermined prefix and / or suffix) at eventization to avoid conflicts with existing table column names.

新しいテーブルに、入力テーブルのセットの各テーブルからローをポピュレートする。インポートされたローに存在しない新しいテーブルのカラムについては、ヌル値を用いることができる。”ｅｖｅｎｔ”カラムは、ローのインポート元の入力テーブルの名前を値として用いる。”ｔｉｍｅ”カラムは、イベント化されたテーブルのソート順を特定する。”ｔｉｍｅ”カラムに入力テーブルの対応するローからタイムスタンプ情報をポピュレートする。 Populate rows from each table in the set of input tables into a new table. Null values can be used for new table columns that do not exist in the imported row. The "event" column uses as input the name of the input table from which the row was imported. The "time" column specifies the sort order of the evented table. Populate the "time" column with timestamp information from the corresponding row of the input table.

ソーステーブルにタイムスタンプデータを有する複数のカラムがある場合、“ｔｉｍｅ１”、“ｔｉｍｅ２”等、複数の”ｔｉｍｅ”カラムを追加することができる。各“ｔｉｍｅ”カラムは、入力テーブルの対応するタイムスタンプカラムによってポピュレートされる。あるいは、タイムスタンプデータを有する入力テーブルのカラムの１つを支配カラム（governing column）として選択し、テーブルのイベント化に用いてよい。入力テーブルのタイムスタンプカラムの選択は、自動規則（例えば、常に、最小の値を選ぶ、または、常に、一番左のカラムを選ぶ）によって、ユーザ入力に基づいて、または、ローが表すイベントが最も起こりそうな時間の導出を支援可能な統計的規則を通して、行われてよい。 If there are multiple columns with timestamp data in the source table, multiple "time" columns can be added, such as "time1", "time2", etc. Each "time" column is populated by the corresponding timestamp column in the input table. Alternatively, one of the columns of the input table having timestamp data may be selected as a governing column and used for eventing the table. The selection of timestamp columns in the input table can be based on automatic rules (for example, always choose the lowest value or always choose the leftmost column), based on user input, or the event that the row represents It can be done through statistical rules that can support the derivation of the most likely time.

イベント化の実施例として、３つのソーステーブル例の下記のスキーマを考える。
table:user_log {"id": "number", "session_start" :
"timestamp", "session_end" : "timestamp"}
table:query_log {"id": "number", "query_text" : "string",
"query_start" : "timestamp", "duration" : "number"}
table:cpu_load {"time":"timestamp", "load" : "float"} As an example of eventing, consider the following schema of three example source tables.
table: user_log {"id": "number", "session_start":
"timestamp", "session_end": "timestamp"}
table: query_log {"id": "number", "query_text": "string",
"query_start": "timestamp", "duration": "number"}
table: cpu_load {"time": "timestamp", "load": "float"}

イベント化されたテーブルについてのスキーマ例を示す。
table:events { "event" : "string",
"_time" : "timestamp",
"_time2" : "timestamp",
"id": "number",
"session_start" : "timestamp",
"session_end" : "timestamp",
"query_text" : "string",
"query_start" : "timestamp",
"duration" : "number",
"time" : "timestamp",
"load" : "float" } An example schema for an evented table is shown.
table: events {"event": "string",
"_time": "timestamp",
"_time2": "timestamp",
"id": "number",
"session_start": "timestamp",
"session_end": "timestamp",
"query_text": "string",
"query_start": "timestamp",
"duration": "number",
"time": "timestamp",
"load": "float"}

各ソーステーブルからの１つのローの例を考える。
user_log : (15213, "01/01/2014 12:15:00",
"01/01/2014 12:15:30")
query_log : (79004525, "select * from T;",
"01/01/2014 10:10:00", 53)
cpu_load : ("01/01/2014 11:20:30", 74.5) Consider an example of one row from each source table.
user_log: (15213, "01/01/2014 12:15:00",
"01/01/2014 12:15:30")
query_log: (79004525, "select * from T;",
"01/01/2014 10:10:00", 53)
cpu_load: ("01/01/2014 11:20:30", 74.5)

下記は、イベント化されたテーブルにどのようにポピュレートするかの実施例である。
("user_log", "01/01/2014 12:15:00", "01/01/2014 12:15:30",
15213, "01/01/2014 12:15:00", "01/01/2014 12:15:30",
NULL, NULL, NULL, NULL, NULL),
("query_log", "01/01/2014 10:10:00", NULL, 79004525, NULL,
NULL, "select * from T;", "01/01/2014 10:10:00", 53,
NULL, NULL),
("cpu_load", "01/01/2014 11:20:30", NULL, NULL, NULL, NULL,
NULL, NULL, NULL, "01/01/2014 11:20:30", 74.5) The following is an example of how to populate an evented table.
("user_log", "01/01/2014 12:15:00", "01/01/2014 12:15:30",
15213, "01/01/2014 12:15:00", "01/01/2014 12:15:30",
NULL, NULL, NULL, NULL, NULL),
("query_log", "01/01/2014 10:10:00", NULL, 79004525, NULL,
NULL, "select * from T;", "01/01/2014 10:10:00", 53,
NULL, NULL),
("cpu_load", "01/01/2014 11:20:30", NULL, NULL, NULL, NULL,
NULL, NULL, NULL, "01/01/2014 11:20:30", 74.5)

イベント化されたテーブルは、スキーマ内の元の位置に従ってタイムスタンプ値を維持するが、１つまたは複数の（この場合は、２つまで）のタイムスタンプを特別な“ｔｉｍｅ”カラムにコピーして、ローをイベント化することに留意する。このようにして、入力の別個のローは、単一の参照“ｔｉｍｅ”カラムを作成しながらも、保存された互いに排他的なタイムスタンプカラムを含んでよい。 The evented table maintains timestamp values according to their original position in the schema, but copies one or more (in this case up to two) timestamps into a special "time" column , Note that we will eventize the row. In this way, separate rows of input may contain stored mutually exclusive timestamp columns while creating a single reference "time" column.

ソース入力での入力テーブルのセットのイベント化は、物理的テーブル(テーブルは具体化される)または論理テーブルとして、存在してよい。論理テーブルについては、イベント化プロセスは、イベント化されたテーブルに属するローのストリームを作成することができ、このストリーム化された入力はＥＴＬプロセスに向けられる。イベント化されたテーブルの各ローは、一定の時間に起こったイベントに関する情報を含むオブジェクトストア内のオブジェクトに類似するので、イベント化によって、関係ソースをオブジェクトソースとして扱うことができる。 Eventization of the set of input tables at the source input may exist as a physical table (tables are instantiated) or as a logical table. For logical tables, the eventing process may create a stream of rows belonging to the evented table, and this streamed input is directed to the ETL process. Because each row of an evented table is similar to an object in the object store that contains information about the events that occurred at a certain time, eventing allows the relationship source to be treated as an object source.

特に、ユーザが、宛先地点でデータにクエリを行って、別々のイベントにわたって情報を抽出したい場合、イベント化によってユーザビリティを向上させることができる。例えば、関係ソースにクエリを行って、一定のタイムスタンプ値（または、値の範囲）を有するローに関する情報を見つけることは、イベント化されたテーブルにクエリを行うよりも複雑で時間がかかり得る。 In particular, if the user wants to query data at a destination point and want to extract information across different events, eventing can improve usability. For example, querying a relationship source to find information about a row with a constant timestamp value (or range of values) can be more complex and time consuming than querying an evented table.

データコレクタ
コレクタは、上記データソースの１つまたは複数から個々のレコードを抽出するのに使用可能なソフトウェア構成要素である。標準フォーマット（圧縮可能であってよい、または圧縮不能であってよい、ＪＳＯＮ、ＢＳＯＮ等)の受信ファイルの処理について前述したが、他の機構を用いて、新しいデータを求めてデータソースを監視し、このデータを採集プロセスに送るために抽出することができる。 Data Collector The Collector is a software component that can be used to extract individual records from one or more of the above data sources. Although the processing of received files in a standard format (which may or may not be compressible, JSON, BSON, etc.) has been described above, other mechanisms may be used to monitor the data source for new data. This data can be extracted for sending to the collection process.

ファイルシステムの監視
データが、ファイルシステムにおいて１つまたは複数のディレクトリで提供される場合、コレクタプロセスは、変更を探してファイルシステムを定期的に監視することができる。ファイルシステムは、従来のローカルファイルシステム（例えば、ｅｘｔ４）であってもよく、ファイルシステムのようなインタフェースを有する分散ストア（例えば、ＨＤＦＳ、ＡｍａｚｏｎＳ３）であってもよい。ファイルシステムの正確なインタフェースに応じて、コレクタは、ファイルシステムを定期的にスキャンして、既存のファイルのリストと比べることによって、または、インタフェースに組み込まれた通知機構（例えば、Ｓ３バケットロギング）を用いることによって、新たなファイルを検出することができる。 File System Monitoring If data is provided in the file system in one or more directories, the collector process can periodically monitor the file system for changes. The file system may be a conventional local file system (e.g., ext4) or a distributed store (e.g., HDFS, Amazon S3) having an interface such as a file system. Depending on the exact interface of the file system, the collector periodically scans the file system and compares it to a list of existing files, or a notification mechanism (eg S3 bucket logging) built into the interface By using it, new files can be detected.

定期的なスナップショット
ソースデータが、関係データベース（例えば、ＭｙＳＱＬ、ＰｏｓｔｇｒｅＳＱＬ、Ｏｒａｃｌｅ等）またはＮｏＳＱＬストア（例えば、ＭｏｎｇｏＤＢ、ＣｏｕｃｈＤＢ等）など、ファイルシステム以外の既存のデータストアに記憶されている場合、コレクタは、内蔵スナップショット機構を利用することができる。大抵のデータストアは、データを、ファイルシステムまたは他のプロセスにエクスポートする機構をサポートしている。例えば、ＰｏｓｔｇｒｅＳＱＬは、ＳＱＬＣＯＰＹコマンドをサポートしており、ｐｇ＿ｄｕｍｐという名の外部ユーティリティを利用可能である。この機構を用いると、新たなレコードを採集するために、ソースデータのダンプを定期的に行うことができる。スナップショットは、リモートプロシージャコールを介してコレクタによって、または、ユーザによって開始することができる。全てそろったスナップショットを撮ると、個々のレコードは、複数のスナップショットに現れ得る。このような重複レコードは、ソース固有の主キーを用いるなどして、または、全てそろったレコードが別々であることが確かな場合、比較の目的で、そのレコード全体に対してハッシュ関数を用いることによって、識別、無視してよい、
複製ログ If the regular snapshot source data is stored in an existing data store other than a file system, such as a relational database (eg MySQL, PostgreSQL, Oracle etc) or a NoSQL store (eg MongoDB, CouchDB etc) Can utilize the built-in snapshot mechanism. Most data stores support mechanisms for exporting data to file systems or other processes. For example, PostgreSQL supports the SQL COPY command, and an external utility named pg_dump is available. Using this mechanism, dumping of source data can be performed periodically to collect new records. The snapshot can be initiated by the collector via remote procedure call or by the user. Taking a complete snapshot, an individual record can appear in multiple snapshots. Such duplicate records should use a hash function on the entire record for comparison purposes, such as using a source-specific primary key, or if it is certain that all the records are separate. By identification, may be ignored,
Replication log

複製をサポートする多くのデータストアは、複製されたものが同期されたままであることを保証するのに用いられる操作ログを維持する。このログに外部からアクセスできる場合、コレクタは、ログを用いて新しいデータを直接見つけてよい。例えば、ＭｏｎｇｏＤＢは、ｏｐｌｏｇ（操作ログ）を公開している。ｏｐｌｏｇには、標準ＭｏｎｇｏＤＢＡＰＩを用いてクエリすることができ、データベースにおける、あらゆる挿入、更新、及び、削除操作についてのエントリを含む。このログを読み取ることによって、監視プロセスは、採集が必要な新しいレコードを識別することができる。
データ抽出 Many data stores that support replication maintain an operation log that is used to ensure that what is replicated remains synchronized. If the log is accessible externally, the collector may use the log to find new data directly. For example, MongoDB publishes oplog (operation log). The oplog can be queried using the standard MongoDB API and includes entries for any insert, update, and delete operations in the database. By reading this log, the monitoring process can identify new records that need to be collected.
Data extraction

新しいソースデータを検出した時点で、そのデータを採集プロセスに送る方法が幾つかある。データが既にファイル内にある場合、採集プロセスは、単に、ファイルを開いて、データを直接読み取ることができる。ファイルがネットワークにある場合、ＨＤＦＳ（Ｈａｄｏｏｐ分散ファイルシステム)のようなネットワークファイルシステムを用いて、ネットワークを通して遠隔で開いてよい。 Once new source data is detected, there are several ways to send that data to the collection process. If the data is already in the file, the collection process can simply open the file and read the data directly. If the file is on a network, it may be opened remotely over the network using a network file system such as HDFS (Hadoop Distributed File System).

データがまだファイルシステム内にない（例えば、データが複製ログに由来している）場合、コレクタは、新しいレコードを含む１つまたは複数の中間ファイルを作成することができる。これらのファイルは、従来のファイルシステム、または、ＨＤＦＳもしくはＡｍａｚｏｎＳ３等の分散ストアに記憶してよい。ファイルは、作成されると、上記のようにロードすることができる。 If the data is not yet in the file system (eg, the data is from a duplicate log), the collector can create one or more intermediate files containing new records. These files may be stored in a conventional file system or distributed store such as HDFS or Amazon S3. Once the file is created, it can be loaded as described above.

コレクタは、プロセス間通信（ＩＰＣ：ｉｎｔｅｒ−ｐｒｏｃｅｓｓｃｏｍｍｕｎｉｃａｔｉｏｎ）またはリモートプロシージャコール（ＲＰＣ)機構を用いて、採集プロセスに直接、データを送ることも可能である。こうすると、採集プロセスは、新しいデータの処理を、そのデータのファイルへの書き込みを待つことなく、開始することができ、データの別個のコピーを維持する必要もなくなる。データのバックアップがあることが望ましい状況（例えば、ソースに一度しかアクセスできない場合など）においては、データを採集プロセスに直接送って、非同期で、バックアップファイルに書き込むことができる。このバックアップは、採集プロセス中にエラーが起きた場合、回復に用いることができる。 Collectors can also send data directly to the collection process using inter-process communication (IPC) or remote procedure call (RPC) mechanisms. In this way, the harvesting process can begin processing new data without waiting to write the data to a file, and it is not necessary to maintain separate copies of the data. In situations where it is desirable to have a backup of the data (eg, if the source is accessed only once), the data can be sent directly to the collection process and asynchronously written to the backup file. This backup can be used for recovery if an error occurs during the harvesting process.

統計
スキーマ推論ステップは、ソースデータを全て処理することが必要なので、データをさらにパスすることなく、このプロセス中、統計を計算することができる。統計は、ＥＴＬの忠実度の強化を含む、多くの利益を提供し得る。
スキーマ属性に関する統計 Statistics Since the schema inference step needs to process all the source data, statistics can be calculated during this process without passing the data further. Statistics can provide many benefits, including enhancing ETL fidelity.
Statistics on schema attributes

統計の第１のクラスは、ソースデータ内の累積スキーマの個々の属性から計算できる。最初に、各属性及び型がデータ内に現れる頻度を追跡することができる。これは、多くの理由で有用であると思われる。例えば、上記のように、マップ装飾のヒューリスティックスは、ある属性が、その属性を含むオブジェクトの頻度に対して、現れる頻度に基づいている。 The first class of statistics can be calculated from the individual attributes of the cumulative schema in the source data. First, we can track how often each attribute and type appears in the data. This appears to be useful for many reasons. For example, as described above, the map decoration heuristics are based on how often an attribute appears relative to the frequency of the object that contains that attribute.

頻度は、また、型に基づいた決定を行うために、そして、型の競合の解決に用いることができる。型情報を用いて、物理的なストレージの決定を最適化することができる。例えば、３２ビット及び６４ビットの整数と、浮動小数点の型と、を区別するために、数値型を追跡し、使用することができる。これらの各型は、異なる量のストレージスペースを必要とし得るので、どの型が適用可能かを決定することによって、最小の必要空間を割り当てることが可能になる。 Frequency can also be used to make type-based decisions and to resolve type conflicts. Type information can be used to optimize physical storage decisions. For example, numeric types can be tracked and used to distinguish between 32-bit and 64-bit integers and floating-point types. Each of these types may require a different amount of storage space, so determining which type is applicable makes it possible to allocate minimal space requirements.

別々の型を有するソースデータ内に同じ属性が複数回、現れるとき、型多様性が生じる。上記のスキーマ推論機構及びインデックスストアは、型多様性を完全にサポートするが、一部のケースでは、あるフィールドの多様性が望まれない場合がある。例えば、単一の属性が、レコードの９９．９％に整数として現れ、他の０．１％に文字列として現れる場合、文字列型のレコードは、誤っているかもしれない。例えば、文字列型のレコードは、データエントリもしくは検証時のエラーを示すかもしれず、ソースデータの破損を示すかもしれない。これらの外れ値のレコードは、ユーザの注意をひくようにしてよく、及び／または、ヒューリスティックスに基づいて自動的に型変換してよい。例えば、レコードの１％未満が、異なる型である場合、それらのレコードは、優勢な型に変換してよく、このイベントを記し、オプションで、型変換されたソースデータを表すＥＴＬプロセスのためのログエントリを作成してよい。 Type diversity occurs when the same attribute appears multiple times in source data having different types. While the above schema inference mechanism and index store fully support type diversity, in some cases, some field diversity may not be desired. For example, if a single attribute appears as an integer in 99.9% of the records and as a string in the other 0.1%, records of type string may be erroneous. For example, string type records may indicate errors in data entry or validation, and may indicate corruption of source data. These outlier records may be made to draw the user's attention and / or may be retyped automatically based on heuristics. For example, if less than 1% of the records are of a different type, those records may be converted to a predominant type, note this event, and, optionally, for an ETL process that represents typed source data May create log entries.

下記のＪＳＯＮレコードの例を考える。
{"id": 12345678, "location": "Saratoga, CA", "tags":
["urgent", "direct"]}
{"id": "12345678", "location": "San Francisco, CA", "source"
"iPhone"} Consider the following JSON record example.
{"id": 12345678, "location": "Saratoga, CA", "tags":
["urgent", "direct"]}
{"id": "12345678", "location": "San Francisco, CA", "source"
"iPhone"}

これらのレコードから推論された累積スキーマは、下記のようになり得る。
{
"id" : "string",
"id" : "number",
"location" : "string",
"source" : "string",
"tags" : ["string"]
} The cumulative schema inferred from these records may be as follows:
{
"id": "string",
"id": "number",
"location": "string",
"source": "string",
"tags": ["string"]
}

累積スキーマは、累積スキーマの各属性と型をカウントに関連付けるＪＳＯＮスキーマで、ラップしてよい。例えば:
{
"type": "object",
"count" : 2,
"properties" : {
"id" : {
"type" : [
{"type": "string", "count": 1},
{"type": "int32", "count": 1},
〕
}"
location" : {
"type": "string",
"count": 2
},
"source" : {
"type": "string"
"count": 1
},
"tags" : {
"type": "array"
"count": 1
"items" : {
"type": "string",
"count": 2
}
}
}
} The cumulative schema may be wrapped with a JSON schema that associates each attribute and type of cumulative schema with a count. For example:
{
"type": "object",
"count": 2,
"properties": {
"id": {
"type": [
{"type": "string", "count": 1},
{"type": "int32", "count": 1},
]
} "
location ": {
"type": "string",
"count": 2
},
"source": {
"type": "string"
"count": 1
},
"tags": {
"type": "array"
"count": 1
"items": {
"type": "string",
"count": 2
}
}
}
}

ｉｄは、一度は文字列として現れ、一度は３２ビット整数として現れるので、各型は、カウント１でリストされるが、ルートオブジェクトのカウントは２であることに留意する。さらに、ｔａｇｓ配列は、一度現れるので、カウントは１であるが、２つの文字列項目を含むので、項目フィールドは、カウント２である。レコードの各属性の頻度は、型付けプロセス中、計算することができ、複数の属性のカウントは、単純に追加することができる。追加は、結合的で可換的なので、このプロセスは、並列で行うことができる。カウントは、別々のデータストリームに関して、独立して維持することができ、マージしてグローバル統計を計算することができる。 Note that each type is listed with a count of one, but the count of the root object is two, since id appears once as a string and once as a 32-bit integer. Furthermore, since the tags array appears once, the count is one, but the item field is count two because it contains two string entries. The frequency of each attribute of the record can be calculated during the typing process, and the count of multiple attributes can be simply added. Because the addition is associative and commutative, this process can be done in parallel. The counts can be maintained independently for separate data streams and can be merged to calculate global statistics.

他の統計は、結合的及び可換的な方法で計算され得る限り、同じように各属性と関連付けることができる。例えば、文字列属性の平均的長さなどの統計や、別の型（日付など）を表す文字列値がどれくらいの頻度で現れるかなどを追跡することができる。平均に関しては、合計とカウントは、別に維持され、全てのデータが集計された時点で割り算が行われる(割り算は、結合的でも可換的でもないからである)。 Other statistics can be associated with each attribute as well, as long as they can be calculated in a joint and commutative way. For example, statistics such as average length of string attributes, and how often string values representing different types (such as dates) appear can be tracked. With regard to the mean, the sum and count are kept separately, and division is performed when all data have been aggregated (since the division is neither associative nor commutative).

値に関する統計
例えば、各キーが現れる頻度など、ソースデータのスキーマに関する統計の収集に加えて、特定の属性に関連付けられた値の統計をとることができる。これらの統計は、クエリ最適化、データ発見、及び、異常検出を含む、様々なアプリケーションに用いることができる。関心のある統計は、属性の型に応じて決まる。数値属性については、これらのメトリクスは、最低値及び最大値などの基本的な統計的尺度と、分散の平均や標準偏差と、を含んでよい。並列処理を可能にするために、統計は、可換的かつ結合的な操作を用いて収集することができるように、選択してよい。 Statistics on Values In addition to collecting statistics on the schema of the source data, such as the frequency with which each key appears, statistics on the values associated with specific attributes can be taken. These statistics can be used in a variety of applications, including query optimization, data discovery, and anomaly detection. The statistics of interest depend on the type of attribute. For numerical attributes, these metrics may include basic statistical measures such as minimum and maximum values, and the mean and standard deviation of the variance. Statistics may be selected so that they can be collected using commutative and joint operations to enable parallel processing.

分散のヒストグラムを含む、スカラ値に関するより高度な統計を、必要に応じて維持することができる。ある程度の量のエラーは容認できるというシナリオにおいては、計算より安価な近似アルゴリズムを用いることができる。例えば、ＨｙｐｅｒＬｏｇＬｏｇアルゴリズムを用いて、カラムの近似濃度(異なった値の数)を計算することができ、Ｑ−Ｄｉｇｅｓｔアルゴリズムを用いて、近似量を計算すことができる。値に関する統計は、プロパティに関する統計と同じように計算することができる。型推論中は、型を決定するために各値を分析し、同時に、統計をまとめることができる。ローカル統計は、メモリに保持することができ、グローバル統計とマージして、メタデータストアにスキーマと共に記憶することができる。大量の状態を保持する統計は、アクセス性能向上のために、オプションで、インデックスストアもしくは専用の統計ストアなどの別のストアに記憶してもよい。 More advanced statistics on scalar values can be maintained as needed, including histograms of variance. In scenarios where some amount of error is acceptable, an approximation algorithm that is less expensive than computation can be used. For example, the HyperLog Log algorithm can be used to calculate the approximate concentration (number of different values) of the column, and the Q-Digest algorithm can be used to calculate the approximate amount. Statistics on values can be calculated in the same way as statistics on properties. During type inference, each value can be analyzed to determine the type and statistics can be compiled at the same time. Local statistics can be kept in memory, merged with global statistics, and stored with the schema in the metadata store. Statistics that hold a large amount of state may optionally be stored in another store, such as an index store or a dedicated statistics store, to improve access performance.

レコード全体に関する統計
一部の統計は、単一の属性または値ではなく、複数の属性に基づいている。よくある例の１つは、どのカラムが頻繁に一緒に発生するかを識別することである。例えば、ログデータに基づいたソースは、同じストリームの別々の型のレコードを組み合わせてよい。下記のＪＳＯＮレコードが一実施例である。
{"user": "mashah", "err_num": 4,
"err_msg": "Connection error."}
{"user": "lstavrachi", "session_time": 23,
"OS": "Mac OS X"}
{"user": "mashah", "session_time": 17,
"OS": "Windows 7"}
{"operation": "update", "duration": 10,
"frequency": 3600} Statistics on the Entire Record Some statistics are based on multiple attributes rather than a single attribute or value. One common example is to identify which columns occur frequently together. For example, sources based on log data may combine different types of records of the same stream. The following JSON record is an example.
{"user": "mashah", "err_num": 4,
"err_msg": "Connection error."}
{"user": "lstavrachi", "session_time": 23 ,,
"OS": "Mac OS X"}
{"user": "mashah", "session_time": 17 ,,
"OS": "Windows 7"}
{"operation": "update", "duration": 10,
"frequency": 3600}

一番目ののレコードが、ユーザのエラーを表し、次の２つが、ユーザのセッションを表し、４番目がシステム操作を表す。これらのレコードから累積スキーマを決定し、その累積スキーマを関係スキーマに変換することは、これらのレコードを同じ関係テーブルに組み合わせることである一方、レコードは、異なるイベントを記述しているので、論理的に分けることができる。 The first record represents the user's error, the next two represent the user's session, and the fourth represents system operation. Determining the cumulative schema from these records and converting that cumulative schema into a relationship schema is to combine these records into the same relationship table, while the records describe different events, so it is logical Can be divided into

レコードのグループ間の相違を分析する１つの方法は、隣接行列を記憶することである。ここで、各エントリは、カラム対を表し、その２つのカラムが同じレコードに現れる回数を含む。隣接行列は、カラムの順は関係ないので、上三角行列または下三角行列であってよい。上の実施例では、ユーザローのエントリ、ｓｅｓｓｉｏｎ＿ｔｉｍｅカラム（同等に、ユーザカラム、ｓｅｓｓｉｏｎ＿ｔｉｍｅｒｏｗ）は、それらの属性が両方とも、２つのレコードに現れるので、２を含むことになり、ユーザローのエントリ及び操作カラムは、それらの属性がデータセットに一緒に現れないので、０を含むことになる。 One way to analyze the differences between groups of records is to store the adjacency matrix. Here, each entry represents a column pair, and includes the number of times the two columns appear in the same record. The adjacency matrix may be an upper triangular matrix or a lower triangular matrix because the order of the columns does not matter. In the above example, the user row entry, session_time column (equivalently, the user column, session_time row) will contain 2 as both of its attributes appear in two records, so the user row entry And the operation column will contain 0 as their attributes do not appear together in the data set.

隣接行列は、隣接リスト等の様々なフォーマットで記憶されてよく、構文解析されたレコードがメモリにあるとき、更新されてよい。複数の行列は、対応するエントリを合計することによってマージすることができ、よって、並列で計算することができる。行列は、カラムの数の二乗として成長するが、疎らな場合、あまりスペースを使わずに圧縮フォーマットで記憶してよい。他の統計同様、行列は、メタデータストアまたはインデックスストアに記憶することができる。 The adjacency matrix may be stored in various formats, such as an adjacency list, and may be updated when parsed records are in memory. Multiple matrices can be merged by summing corresponding entries, and thus can be calculated in parallel. Matrices grow as the square of the number of columns, but if sparse, may be stored in a compressed format with less space. As with other statistics, matrices can be stored in a metadata store or index store.

隣接行列は、一旦、計算されると、幾つかの方法で関連カラムの識別に用いることができる。行列は、重み付けグラフに対応し、そこでは、ノードは属性であり、そのエンドポイントに対応するカラムが一緒にｉ回現れる場合、辺は、重みｉを有する。このグラフのクリークは、あらゆるカラムが他のあらゆるカラムと共に現れるカラムのセットを表す。上記の実施例は、下記のクリークを含む。
("user", "err_num", "err_msg")
("user", "session_time", "OS")
("operation", "duration", "frequency") The adjacency matrix, once calculated, can be used to identify relevant columns in several ways. The matrix corresponds to a weighting graph, where the nodes are attributes and the edge has a weight i if the columns corresponding to that endpoint appear i times together. The cliques in this graph represent the set of columns in which every column appears with every other column. The above example includes the following cliques:
("user", "err_num", "err_msg")
("user", "session_time", "OS")
("operation", "duration", "frequency")

これらは、役に立つことには、３つの別個のイベント型に対応し、ユーザに提示することができ、または、自動データ変換に用いることができる。これらのクリークは、標準グラフアルゴリズムを用いて計算することができ、そのアルゴリズムは、クリークの各辺が少なくとも最小の重みを有するように構成することができる。 These correspond in useful ways to three separate event types, which can be presented to the user or used for automatic data transformation. These cliques can be calculated using standard graph algorithms, which can be configured such that each side of the clique has at least a minimum weight.

関連カラムをより広いビューでとらえるために、上記グラフ内の全ての接続された構成要素をグループ化することができる。言い換えれば、間に経路のある２つの属性はいずれも、組み合わせて単一のグループにすることができる。上記実施例では、下記の２つのグループが作られる。
("user", "err_num", "err_msg", "session_time", "OS")
("operation", "duration", "frequency") All connected components in the graph can be grouped to capture related columns in a wider view. In other words, any two attributes with paths in between can be combined into a single group. In the above embodiment, the following two groups are formed.
("user", "err_num", "err_msg", "session_time", "OS")
("operation", "duration", "frequency")

ｅｒｒ＿ｎｕｍとＯＳは、エラーレコードとセッションレコードの両方が、ｕｓｅｒフィールドを有するので、同じ接続された構成要素内に現れるが、同じクリーク内には現れないことに留意する。関連カラムのこのゆるい概念は、大きな無関連のデータのセットを大まかに分けるために有用であると思われる。 Note that err_num and the OS both appear in the same connected component but not in the same clique because both error records and session records have a user field. This loose notion of related columns seems to be useful for roughly separating large unrelated sets of data.

上記の隣接行列は、各レコードのスキーマのみに基づいているが、一定の値をカラムの有無と相関させることが望ましい場合がある。無関連のレコードが、異なるスキーマを有する１つの状況は、例えば、各レコードがレコードの型を識別する明示的な属性（例えば、ｅｖｅｎｔ＿ｉｄ）を含む場合である。上記実施例においては、ユーザエラーは、１のｅｖｅｎｔ＿ｉｄ、ユーザセッションは、２のｅｖｅｎｔ＿ｉｄ、システム操作は、３のｅｖｅｎｔ＿ｉｄを有し得る。ｅｖｅｎｔ＿ｉｄの意味を決定（または、ユーザが示す)ことができる場合、ｅｖｅｎｔ＿ｉｄを用いて、イベント型によって属性を分離することができる。この場合、各ｅｖｅｎｔ＿ｉｄ値について、全てのレコードのスキーマを、そのｅｖｅｎｔ＿ｉｄ値とマージすることによって、別個の累積スキーマを維持することができる。ｅｖｅｎｔ＿ｉｄによってデータソースを分割するプロセスは、「シュレッディング（ｓｈｒｅｄｄｉｎｇ）」と呼ばれ、以下に記載する。 Although the above adjacency matrix is based solely on the schema of each record, it may be desirable to correlate certain values with the presence or absence of columns. One situation where unrelated records have different schemata is, for example, if each record contains an explicit attribute (eg event_id) that identifies the type of record. In the above example, the user error may have an event_id of 1, the user session may have an event_id of 2, and the system operation may have an event_id of 3. If the meaning of event_id can be determined (or indicated by the user), event_id can be used to separate attributes by event type. In this case, for each event_id value, it is possible to maintain a separate cumulative schema by merging the schemas of all the records with that event_id value. The process of dividing a data source by event_id is called "shredding" and is described below.

インデックスストアを用いた非同期統計
上記統計は、一般的に並列で計算することができる一方、一部の統計は、一旦、全てのデータを準備しなければ、計算するのは難しい（例えば、メジアンやモード）。これらの統計は、インデックスストアのクエリ機能を用いてデータのロード完了した後、計算してよい。これには、大量のデータのスキャンが必要になり得るので、インデックスストアがアイドル状態のときに、非同期で行ってよい。 Asynchronous Statistics with Index Store While the above statistics can generally be calculated in parallel, some statistics are difficult to calculate once all the data has been prepared (eg median or mode). These statistics may be calculated after the data loading is complete using the index store's query function. This may require scanning a large amount of data, so it may be done asynchronously when the index store is idle.

エラー統計
レコードにとって有益であり得る統計の別のクラスは、エラー統計である。採集中、データ自体またはシステム操作で遭遇し得る多くの異なる種類のエラーがある。例えば、これらは、入力データ解凍に関するエラー、入力ファイルフォーマットのエラー、特定のレコードの構文解析エラー、文字列符号化のエラー（例えば、ＵＴＦ−８）、ロックされたファイルを開こうとしてのエラー、及び、ネットワークを介したソースデータへのアクセスのエラーを含んでよい。これらの統計に関する情報は、メタデータストアに保持することができ、必要に応じて、ユーザ及び／またはアドミニストレータに送ることができる。 Error Statistics Another class of statistics that may be useful for records is error statistics. During collection, there are many different types of errors that may be encountered in the data itself or system operation. For example, these are errors related to input data decompression, input file format errors, specific record parsing errors, string encoding errors (eg UTF-8), errors trying to open locked files, And errors in access to source data through the network. Information about these statistics can be kept in a metadata store and can be sent to users and / or administrators as needed.

インデックスストア内の中間ストレージ
採集したデータは、記憶及びクエリを容易にする、上記のＢｉｇＩｎｄｅｘ（ＢＩ）、ＡｒｒａｙＩｎｄｅｘ（ＡＩ）及びＲｏｗＩｎｄｅｘ（ＲＩ）を含み得るインデックスのコレクションに記憶することができる。インデックスストアに、ユーザは直接クエリを行うことができるが、インデックスストアは、別のシステムをロードするプロセス中に、中間ストアとして用いることもできる。この欄は、ＥＴＬプロセスのための中間ストアとしてのインデックスストアの使用を記述する。 Intermediate Storage in Index Store The collected data can be stored in a collection of indexes that may include BigIndex (BI), ArrayIndex (AI) and RowIndex (RI) above, which facilitates storage and querying. The index store can be queried directly by the user, but the index store can also be used as an intermediate store during the process of loading another system. This column describes the use of the index store as an intermediate store for the ETL process.

バルクエクスポート
インデックスストアをＥＴＬの中間ストレージ領域として用いる際、インデックスストアからのバルクデータの効率の良いエクスポートが重要である。例えば、ハードディスクドライブ（特に、磁気ドライブ）を用いるとき、データの大きいチャンクを連続して読み取ることによって、シーク待ち時間を避けて、最大帯域幅を達成するのが、より効率的である。 Bulk Export When using an index store as an intermediate storage area for ETL, efficient export of bulk data from the index store is important. For example, when using hard disk drives (especially magnetic drives), it is more efficient to avoid seek latency and achieve maximum bandwidth by reading large chunks of data sequentially.

ハードディスクドライブのように連続的にアクセスされる媒体について、所与のピーク性能比（ｆ）を決定してよい。ディスクのシーク待ち時間を（Ｌ）、持続的な連続帯域幅（sustained sequential bandwidth）を（Ｂ）とすると、求められるピーク性能比を達成するためにシーク毎に読み取る必要のあるバイト数（Ｎ）は、以下のように計算できる。
N = ( f / (1-f) ) * L * B For media that are continuously accessed, such as hard disk drives, a given peak performance ratio (f) may be determined. Given the seek latency of the disk (L) and sustained sustained bandwidth (B), the number of bytes that need to be read per seek (N) to achieve the required peak performance ratio Can be calculated as follows.
N = (f / (1-f)) * L * B

例えば、１００ＭＢ／秒の持続帯域幅と、平均シークあたり待ち時間１０ミリ秒のディスクを考える。９０％のピーク性能（ｆ＝．９）を達成するためには、
N = (.9 / (1 - .9)) * 10 ms/seek * 100 MB/s
N = 9 MB/seek For example, consider a disk with a sustained bandwidth of 100 MB / s and a 10 ms latency per seek. To achieve 90% peak performance (f = .9)
N = (.9 / (1-.9)) * 10 ms / seek * 100 MB / s
N = 9 MB / seek

従って、指定の比の性能を達成するためのゴールは、少なくとも９ＭＢの順次バーストでデータを読み取ることである。データのターゲットが、ディスクベースのシステムである場合、ターゲットの書き込み性能に基づいて、同じ式が当てはまる。 Thus, the goal to achieve a specified ratio of performance is to read the data in sequential bursts of at least 9 MB. If the target of data is a disk based system, the same formula applies, based on the write performance of the target.

概念上は、インデックスストアからのデータのエクスポートは、ロー毎にまたはカラム毎に行うことができる。モードの選択は、宛先データストアの能力に応じて決まる。カラムをエクスポートするとき、ＢＩ、ＡＩまたはＭＩ（これらはそれぞれ、最初に、カラムによってソートされ、次に、タプルＩＤによってソートされてよい）から広範囲のソートされたデータがエクスポートされてよい。要求されたタプルが一旦ソートされると、各カラムについて、少なくとも９ＭＢのタプルを読み取る必要がある。効率を良くするために、１つのカラムからの全てのタプルは、次のカラムに移動する前に、読み取られ、出力に書き込まれてよい。 Conceptually, exporting data from the index store can be done row by row or column by column. The choice of mode depends on the capabilities of the destination data store. When exporting columns, a wide range of sorted data may be exported from BI, AI or MI (each of which may first be sorted by column and then sorted by tuple ID). Once the requested tuples are sorted, at least 9 MB of tuples need to be read for each column. For efficiency, all tuples from one column may be read and written to the output before moving to the next column.

ローをエクスポートするとき、少なくとも２つの選択肢がある。ＲＩが利用可能な場合、要求されたデータは、ＲＩでタプルＩＤによってソートされ、ソート順で読み取ることができる。ＲＩもソート順で記憶されるので、これによって、ディスクに順次アクセスする。次に、出力システムに合わせて適切にローをフォーマットする必要がある。 When exporting rows, there are at least two options. If RI is available, the requested data can be sorted by Tuple ID in RI and read in sorted order. Since RIs are also stored in the sort order, this sequentially accesses the disk. Next, we need to format the rows appropriately for the output system.

ＲＩをエクスポートに使用しない場合、ローは、例えば、ＢＩ、ＡＩ及びＭＩから構築されてよい。効率性のために、大きいチャンクのデータを、一度に、ＢＩ、ＡＩ及びＭＩの各カラムから読み取る。出力ローを生成するために、所与のローについてデータを各出力カラムから読み取った後、ローを生成することができる。 If RI is not used for export, rows may be constructed from BI, AI and MI, for example. For efficiency, large chunks of data are read at one time from the BI, AI and MI columns. Rows can be generated after reading data from each output column for a given row to generate an output row.

十分なＲＡＭがあれば、各カラムのＮメガバイトのチャンクは、ＲＡＭに読み取ることができ、そして、ローを出力することができる。カラムが圧縮される場合、メモリを節約するために、カラムは、可能な程度まで圧縮したままでよい。さらに、データが書き出されると、メモリは、リアルタイムで解放されてよい。１つのカラムのＮメガバイトのチャンクは、別のカラムのＮメガバイトのチャンクと同数のタプルを有する可能性は低い（特に、圧縮を使用しているとき）。従って、１つのカラムのタプルが別のカラムのタプルの前に枯渇することになるので、各カラムのチャンクは、独立して、フェッチされる必要があり得る。ディスクの待ち時間を最小にするために、プリフェッチを採用することができる。しかし、プリフェッチ及び部分解凍は両方とも、変換の必要メモリを増やし得る。 If there is enough RAM, N megabytes chunks of each column can be read into RAM and can output low. If the column is compressed, the column may remain compressed to the extent possible to save memory. Furthermore, memory may be released in real time as data is written out. An N-megabyte chunk in one column is unlikely to have the same number of tuples as an N-megabyte chunk in another column (especially when using compression). Thus, the chunks of each column may need to be fetched independently, as the tuples of one column will be depleted before the tuples of another column. Prefetching can be employed to minimize disk latency. However, both prefetching and partial decompression can increase the memory required for conversion.

ＢＩ、ＡＩ及びＭＩからのローのバルクエクスポートについての疑似コードの例である。
create a cache for each column,
this cache caches data for a tid with a
replacement policy geared towards streaming
and implements prefetching.
for tid in output_touple_ids:
start building new touple
for each column of this touple:
lookup tid in cache
if tid present
evict tid to optimize for streaming
check for free space in this cache,
if ample free space,
prefetch the next block
if tid not present,
fetch the block of data containing the tid from
the AI,
if necessary,
decompress chunk containing datum
add datum to touple
add touple to output buffer
if output buffer is full,
send output to destination An example of pseudo code for bulk export of rows from BI, AI and MI.
create a cache for each column,
this cache caches data for a tid with a
replacement policy geared towards streaming
and implements prefetching.
for tid in output_touple_ids:
start building new touple
for each column of this touple:
lookup tid in cache
if tid present
evict tid to optimize for streaming
check for free space in this cache,
if ample free space,
prefetch the next block
if tid not present,
fetch the block of data containing the tid from
the AI,
if necessary,
decompress chunk containing datum
add datum to touple
add touple to output buffer
if output buffer is full,
send output to destination

システムが、ＲＡＭを制限している場合、変換は、複数のパスで行うことができる。ここで、各パスは、できるだけ多くのカラムチャンクを読み取り、部分的なロー出力チャンクを生成し、それらはディスクに記憶される。そうすると、部分的なロー出力チャンクは、カラムのように扱うことができ、プロセスは、全てのローが出力されるまで、繰り返すことができる。 If the system has limited RAM, the conversion can be done in multiple passes. Here, each pass reads as many column chunks as possible, producing partial raw output chunks, which are stored on disk. Then, partial row output chunks can be treated like columns, and the process can be repeated until all the rows have been output.

例えば、図１５を参照すると、カラムＡ、Ｂ、Ｃ、Ｄを含むインデックスの、不揮発性ストレージ１５８０（例えば、ハードディスクドライブ上)の配置例が示されている。あるローをエクスポートするために、カラムＡ、Ｂ、Ｃ、Ｄのそれぞれからのデータが要求される。不揮発性ストレージ１５８０の各カラムからの単一のカラム値の読み取りは、シークタイム及びアクセスタイムに大きなペナルティとなり得る。その読み取りに代えて、各カラムのチャンクをメモリに読み取ることができる。これを単純化して例示すると、不揮発性ストレージ１５８０の各カラムには、１０２４個のエントリがある。メモリ１５８４（揮発性のダイナミックランダムアクセスメモリであってよい)には、５１２個のエントリが入る余地がある。よって、カラムＡ、Ｂ、Ｃ、Ｄのそれぞれに対する１２８個のエントリが、次に、不揮発性ストレージ１５８０からメモリ１５８４に読み取られる。 For example, referring to FIG. 15, an example arrangement of non-volatile storage 1580 (eg, on a hard disk drive) of an index including columns A, B, C, D is shown. In order to export a row, data from each of columns A, B, C, D is required. Reading a single column value from each column of non-volatile storage 1580 can be a significant penalty for seek time and access time. Instead of reading the chunks of each column can be read into memory. To simplify this, each column of the non-volatile storage 1580 has 1024 entries. Memory 1584 (which may be volatile dynamic random access memory) has room for 512 entries. Thus, the 128 entries for each of columns A, B, C, D are then read from non-volatile storage 1580 into memory 1584.

４つの読み取りのそれぞれが完了すると、それぞれ、４つのカラムＡ、Ｂ、Ｃ、Ｄのそれぞれからのエントリを含む、１２８のローをエクスポートすることができる。様々な実装において、各カラムのエントリのサイズは異なってよい。従って、メモリ１５８４内のストレージスペースは、多くのエントリを記憶するカラムが多くのスペースを与えられるように、カラム間で不均等に分けられてよい。こうすることによって、各カラムについて、ほぼ同数のエントリを記憶することができる。あるいは、メモリ１５８４は、カラム間で等しくストレージを割り当ててよい。その結果、多くのエントリを記憶するカラムは、メモリ１５８４内に、一度に記憶されるエントリが少なくなる。従って、ローがエクスポートされていくにつれて、これらのカラムからの新しいデータは、エントリの少ないカラムより早く、メモリ１５８４にロードされる必要がある。 Upon completion of each of the four reads, 128 rows can be exported, each containing an entry from each of the four columns A, B, C, D. In various implementations, the size of the entries in each column may be different. Thus, storage space in memory 1584 may be unevenly divided between columns such that the columns storing many entries may be provided with more space. By doing this, approximately the same number of entries can be stored for each column. Alternatively, memory 1584 may allocate storage equally between columns. As a result, columns that store many entries will have fewer entries stored in memory 1584 at one time. Thus, as rows are being exported, new data from these columns need to be loaded into memory 1584 earlier than columns with few entries.

インデックスストア管理
インデックスストアは、スナップショット、回復、サイズ変更、及び、移行など、様々な管理機能をサポートするように構成されてよい。これらの機能は、インデックスストアの書き込みと、Ｌｉｎｕｘ（登録商標）論理ボリュームマネージャ（ＬＶＭ）等の基礎的システム技術の利用との、適切なオーケストレーションによって、実装することができる。 Index Store Management Index stores may be configured to support various management functions such as snapshot, recovery, resizing, and migration. These functions can be implemented by appropriate orchestration of writing index stores and using underlying system technologies such as Linux® Logical Volume Manager (LVM).

インデックスストアのスナップショットは、インデックスストアをクローンするために、または、回復を支援するために、バックアップに使用することができる。スナップショットは、書き込みのバッファリングを開始し、インデックスストアをディスクにフラッシュし、ファイルシステムまたはボリュームマネージャ内の基礎的スナップショットファシリティを使用し、そして、書き込みを適用することによって、達成することができる。さらに、ＬｅｖｅｌＤＢ等のアーキテクチャに基づいたシステムでは、書き込みをバッファリングする代わりに、システムが、データを圧縮し、スナップショットの部分として、下位レベルにマークを付け、それら下位レベルがさらに圧縮されないようにし、下位レベルを含むファイルをコピーし、そして、圧縮を再度有効にするこができる。 Index store snapshots can be used for backup to clone the index store or to aid recovery. A snapshot can be achieved by starting buffering of writes, flushing the index store to disk, using the underlying snapshot facility in the file system or volume manager, and applying the writes . Furthermore, in systems based on an architecture such as LevelDB, instead of buffering writes, the system compresses the data and marks lower levels as part of a snapshot so that those lower levels are not further compressed. You can copy files containing lower levels, and re-enable compression.

回復は、基礎的なシステムに従ってバックアップデータをリストアし、インデックスストレージサービスを開始し、そして、リストアされたデータを指摘するようにメタデータサービスを更新することによって、達成される。記憶されたタプルＩＤのセットは、スナップショットが撮られる時に、メタデータサービスに記録され、スナップショットがリストアされる時にリストアされる。スナップショットからのリストアの後、欠けているデータは、全てのタプルＩＤのセットと、回復されたインデックスストアに記憶されたタプルＩＤのセットとを比較することによって、決定することができる。 Recovery is accomplished by restoring backup data according to the underlying system, starting an index storage service, and updating the metadata service to point out the restored data. The stored set of tuple IDs is recorded in the metadata service when a snapshot is taken and restored when the snapshot is restored. After restoration from a snapshot, the missing data can be determined by comparing the set of all tuple IDs with the set of tuple IDs stored in the recovered index store.

インデックスストアの縮小は、インデックスストア内の全てのデータを圧縮し、ファイルシステムサイズを小さくし、論理ボリュームサイズを小さくすることによって、達成されてよい。ディスク（または、仮想ディスク）を取り除くための十分なフリースペースがある場合、そのディスク上のデータは、ボリュームグループ内の他のフリースペースに移行することができ、ディスクは、ボリュームグループから取り除くことができ、次に、システムから取り除くことができる。インデックスストアの成長は、必要があれば、ディスクまたは仮想ディスクを追加することによって行ってよい。ディスクが追加された場合、そのディスクは、ボリュームグループに含むことができるので、論理ボリュームサイズを増やし、ひいては、ファイルシステムサイズを増やすことができる。インデックスストアの一部の実装においては、ファイルシステムは用いなくてもよく、その場合、インデックスストアは、ファイルシステムに依存する代わりに、サイズ変更操作を実施してよい。一部の実装においては、インデックスストアは、ＬＶＭを用いなくてもよく、自身で直接、ディスクを管理することができる。 Index store reduction may be achieved by compressing all data in the index store, reducing the file system size, and reducing the logical volume size. If there is enough free space to remove a disk (or virtual disk), data on that disk can be migrated to other free space in the volume group and the disk can be removed from the volume group It can then be removed from the system. The growth of the index store may be done by adding disks or virtual disks, if necessary. When a disk is added, the disk can be included in the volume group, so the logical volume size can be increased and thus the file system size can be increased. In some implementations of the index store, the file system may not be used, in which case the index store may perform resizing operations instead of relying on the file system. In some implementations, the index store does not have to use LVM, and can directly manage the disk itself.

クラウド環境において、または、サーバメンテナンス中に、インデックスストアを１つのマシンから別のマシンに移行させるのが望ましい。最も簡単なケースは、これをオフラインで行うことである。これは、読み取り及び書き込みをブロックし、インデックスストレージサービスをシャットダウンし、ファイルシステムをアンマウントし、論理ボリュームを無効にし、ディスク（または、仮想ディスク）を新しいマシンに移し、論理ボリュームを再有効化し、ファイルシステムをマウントし、インデックスストレージサービスを再び開始し、そして、メタサービスが所与のインデックスストアが存在する場所を知るようにメタデータサービスを更新することによって、行うことができる。 It is desirable to migrate the index store from one machine to another in a cloud environment or during server maintenance. The simplest case is to do this offline. It blocks reads and writes, shuts down the index storage service, unmounts file systems, deactivates logical volumes, transfers disks (or virtual disks) to a new machine, reactivates logical volumes, and files This can be done by mounting the system, restarting the index storage service, and updating the metadata service so that the metaservice knows where the given index store resides.

オンラインのインデックスストアを移行するために、新しいインデックスストアを立ち上げながら、システムは、書き込みをバッファリングし、古いインデックスストアからの読み取りを行ってよい。次に、データは、古いインデックスストアから新しいインデックスストアにコピーされ、新しいインデックスストアは、アクティブとしてマーク付けされる。次に、読み取りは、新しいインデックスストアに向けられて、最後に、バッファリングされた書き込みが適用される。この移行を最適化するために、新しいインデックスストアをスナップショットからリストアすることができる。このようなシナリオにおいては、書き込みはバッファリングされ、インデックスストアの現在の状態と、スナップショットとの間のデルタ（増分）を計算することができる。デルタは、新しいシステムのスナップショットを最新にするために、適用することができる。次に、新しいインデックスストアは、アクティブとマーク付けされ、バッファリングされた書き込みは、新しいインデックスストアに適用される。バッファリング時間を最小にするために、書き込みをバッファリングする前に、複数のデルタを計算、適用することができる。 The system may buffer writes and read from the old index store while launching a new index store to migrate the online index store. Next, data is copied from the old index store to the new index store, and the new index store is marked as active. The read is then directed to the new index store and finally the buffered write is applied. To optimize this migration, new index stores can be restored from snapshots. In such a scenario, the writes can be buffered and the delta (increment) between the current state of the index store and the snapshot can be calculated. Deltas can be applied to update the snapshot of a new system. Next, the new index store is marked as active, and buffered writes are applied to the new index store. Multiple deltas can be calculated and applied before buffering writes to minimize buffering time.

インデックスストア拡張子
系列
１つのデータストアから別のデータストアにデータを移す時、システムを通って移動する各データの系列を追跡可能であるのが望ましい。実施例として、各レコードが復帰改行で分けられたＪＳＯＮレコードを含むファイルのコレクションを考える。これらのファイルからのデータは、インデックスストアにロードしてよく、インデックスストアからデータウェアハウスにロードされてよい。レコードの系列を維持するために、各レコードは、例えば、各レコードのソースファイル名と行番号を記録することによって、追跡することができる。系列情報は、追加のカラム(またはカラムのセット)として、インデックスストアに記憶することができる。 Index Store Extension Sequences When transferring data from one data store to another, it is desirable to be able to track each sequence of data moving through the system. As an example, consider a collection of files where each record contains a JSON record separated by a carriage return. Data from these files may be loaded into the index store and may be loaded from the index store into the data warehouse. In order to maintain a sequence of records, each record can be tracked, for example, by recording the source file name and line number of each record. Series information may be stored in the index store as additional columns (or sets of columns).

これらの追加カラムも、データウェアハウスにロードされてよい。こうすると、エンドユーザは、レコードの元のソースを見つけることができる。ユーザは、エラーがあるか否かを知ろうとして、または、削除された可能性のある、もしくは、ロードしないと選択された他のデータを見つけようとして、レコードの元のソースを見つけたいかもしれない。システムは、その情報を用いて、データの紛失があるか否かを決定することができる（例えば、ファイルやローによるソートを行い、ファイルまたはローが無くなっているかを見つける）。同様に、また、おそらくは類似の利益を伴って、データが、インデックスストアからデータウェアハウスにロードされる時、追加カラムを作成して、インデックスストアのタプルＩＤをデータウェアハウスに記録することができる。これによって、システムは、ある種のエラーからの回復が可能になり、インデックスストアとデータウェアハウスとのデータの比較が可能になる。 These additional columns may also be loaded into the data warehouse. In this way, the end user can find the original source of the record. The user may want to find the original source of the record, trying to find out if there is an error or to find other data that may have been deleted or was chosen not to load. Absent. The system can use that information to determine if there is a loss of data (eg, sort by file or row to find out if the file or row is lost). Similarly, and possibly with similar benefits, when data is loaded from the index store into the data warehouse, additional columns can be created to record the index store's tuple ID in the data warehouse . This allows the system to recover from certain errors and allows comparison of data between the index store and the data warehouse.

テンポラル（時間による）／バイテンポラル（２つの時間による）サポート
時間は、多くのシステムにおいて基本的変数である。一部の分析は、時間フィールドの意味を理解するシステムによって、改善される、または、可能になる。オブジェクトの複数のバージョンを保持し、時間を使って、それらのバージョンを区別するデータベースは、テンポラル（時間）データベースと呼ばれることが多い。オブジェクトに適用する時間には複数の概念がある。時間についての２つ一般的概念は、トランザクション時間（ＴＴ）と有効時間（ＶＴ）である。トランザクション時間は、システムがトランザクションを行った時間である。有効時間は、１つのデータが有効である時点または、時間範囲である。例えば、ある人が住んでいる特定の住所は、その人がそこに住んでいる特定の期間（ＶＴ）に関連する。住所がシステムに記録された時は、ＴＴである。 Temporal (by time) / bitemporal (by two times) support time is a fundamental variable in many systems. Some analysis is improved or made possible by a system that understands the meaning of the time field. A database that holds multiple versions of an object and uses time to differentiate between those versions is often referred to as a temporal (time) database. There are multiple concepts of time applied to an object. Two general concepts about time are transaction time (TT) and time to live (VT). Transaction time is the time when the system performed a transaction. The valid time is the time or range of time when one data is valid. For example, a particular address where a person lives may be associated with a particular time period (VT) in which the person lives. When the address is recorded in the system, it is TT.

多くの環境において、データの履歴を見ることできるのは、有利である。履歴データにクエリを行う１つの機構は、クエリＡＳＯＦとして、特定の時点を定義することである。システムに、２０１４年１月３１日現在（ＡＳＯＦ）入手可能であった全ての情報に基づいて、質問に応答するよう求めるとする。これらのクエリは、ＡＳＯＦ時以下の最大ＴＴを有するオブジェクトをレビューしてよい。 In many environments, it is advantageous to be able to view the history of data. One mechanism for querying historical data is to define a particular point in time as the query AS OF. Suppose that the system is asked to respond to questions based on all information available as of January 31, 2014 (AS OF). These queries may review objects with maximum TT below AS OF.

他の環境においては、ユーザは、オブジェクトに関する事実が、経時的にどのように変化したかについて知りたい場合がある。ある人の自宅の住所が、それぞれ、異なる有効時間を有する多くのバージョンで、データベースに記録されている場合がある。データベースのユーザが、ＡＳＡＴ特定の時間で、その人の住所のクエリを行いたいとする。このクエリは、ＶＴがＡＳＡＴ時間を含むオブジェクトをレビューしてよい。 In other circumstances, the user may want to know how the facts about the object changed over time. A person's home address may be recorded in the database in many versions, each with different effective times. A user of a database wants to query the address of a person at a specific time of AS AT. This query may cause VT to review objects that contain ASA Time.

時間の１つの概念（トランザクション時間であることが多い)だけを主にサポートするデータベースは、モノテンポラルと考えられる。トランザクション時間と有効時間の両方など、時間の２つの概念をサポートするデータベースは、バイテンポラルと考えられる。バイテンポラルデータベースは、オブジェクトのバージョンの二次元空間をサポートし、ＡＳＯＦで特定のＴＴと、ＡＳＡＴで特定のＶＴの両方のクエリを行うことによって、バージョンの絞り込みをサポートする。 A database that primarily supports only one concept of time (often the transaction time) is considered to be monotemporal. A database that supports two notions of time, such as both transaction time and validity time, is considered bitemporal. The bitemporal database supports two-dimensional space of object versions, and supports version narrowing by querying both specific TT with AS OF and specific VT with AS AT.

Ｒ木、ｋｄ木、及び、Ｚ−ｏｒｄｅｒｉｎｇ（Ｚ順序付け）等の空間的インデックス方法を用いて、Ｎ次元空間で互いに近いオブジェクトが、インデックスで互いに近くなるように、多次元に沿ってインデックスを構築することができる。 Construct indexes along multiple dimensions so that objects that are close to each other in N-dimensional space are near each other in index using spatial indexing methods such as R-trees, kd-trees, and Z-ordering (Z-ordering) can do.

インデックスストアの時間性をサポートするために、時間の次元について別のインデックスを作成することができる。このインデックスは、トランザクション時間をタプルｉｄにマップするテンポラル時間インデックス（ＴＴＩ）であってよい。別のインデックスは、有効時間をタプルｉｄにマップする有効時間インデックス（ＶＴＩ）であってよい。バイテンポラルシステムは、バイテンポラルインデックス（ＢＴＩ）内の有効時間とテンポラル時間の両方をタプルｉｄにマップするための空間的インデックスを含んでよい。 In order to support the index store temporality, another index can be created for the time dimension. This index may be a temporal time index (TTI) that maps transaction time to tuple id. Another index may be a valid time index (VTI) that maps valid time to tuple id. A bitemporal system may include a spatial index to map both valid and temporal times in a bitemporal index (BTI) to a tuple id.

更新
上記のインデックスストアは、様々な実装において、バルク挿入またはバルク追加を効率よく処理する。これは、ハードディスクドライブを有するシステム上での、クエリ及び抽出に効果がある。更新及び削除(既存のタプルが何らかの方法で変更されるとき、更新が起こる)を効率よく処理するためには、インデックスストアへの何らかの変更が必要となり得る。これらの更新は、ＭｏｎｇｏＤＢ及びそのｏｐｌｏｇ等の何らかの種類のトランザクションストアが行った変更の順序付リストとして、到着してよい。 Update The above index store efficiently handles bulk insertion or bulk addition in various implementations. This is useful for queries and extractions on systems with hard disk drives. Some changes to the index store may be necessary to efficiently process updates and deletes (when an existing tuple changes in some way, updates occur). These updates may arrive as an ordered list of changes made by some type of transaction store, such as MongoDB and its oplogs.

更新を処理する１つの方法は、更新をオブジェクトのインプレース値に適用することである。単一のローの更新は、インデックスストアにおいては高価になり得るので、更新は書き込み最適化ストア（例えば、ローストア）にバッファリングされる。図１４を参照すると、書き込み最適化ストア１５５２が示されている。クエリは、最初に、書き込み最適化ストア内の値を探し、次に、インデックスストアを探す。十分なデータが書き込み最適化ストアにあれば、そのデータをパッケージにして、インデックスストアでバルク更新を行うことができる。 One way to handle the update is to apply the update to the in-place value of the object. Updates are buffered in a write optimized store (e.g., a row store) as single row updates can be expensive in an index store. Referring to FIG. 14, a write optimization store 1552 is shown. The query first looks for values in the write optimization store, then for the index store. If enough data is in the write optimization store, it can be packaged and bulk updated in the index store.

更新を処理する別の方法は、入力からレコード／オブジェクトを取り出し、それらを、何らかのキー及びトランザクションレジスタに記録されたトランザクション時間に基づいて、異なるバージョンのレコード／オブジェクトに変換することである。変換されたレコード／オブジェクトは、次に、新しいレコードとして、インデックスストアに追加することができる。宛先ストアがテンポラルな場合、ＡＳＯＦクエリを用いて、過去のクエリを行うことができる。 Another way to handle updates is to take records / objects from the input and convert them into different versions of records / objects based on some key and transaction time recorded in the transaction register. The converted record / object can then be added to the index store as a new record. If the destination store is temporal, an AS OF query can be used to perform past queries.

データ変換
中間ストアが存在する場合、中間ストアからデータをエクスポートするときに、変換を行うことができる。これによって、異なる宛先に対して異なる変換が可能になり、また、変換を行う前に、良く定義された単一のソースデータ表現の作成をサポートし得る。中間ストアを使用しない場合、各カラムを関係に変換する時に、変換を行ってよい。 Data conversion If an intermediate store exists, conversion can be done when exporting data from the intermediate store. This allows different transformations for different destinations, and may support the creation of a well-defined single source data representation before doing the transformation. If an intermediate store is not used, a conversion may be performed when converting each column to a relationship.

型変換
１つのよくある変換は、値を１つの型から別の型に変換することである。例えば、ＪＳＯＮは、データ型を定義しないので、文字列（例えば、ＩＳＯ８６０１に従って）または数値（例えば、ＵＮＩＸ（登録商標）エポックからの秒で）として、日付を記憶するのが一般的である。宛先が、データ型をサポートしている(大抵の関係データベースはサポートしている)場合、型変換ディレクティブを追加して、値を日付に変換することができる。同様のディレクティブを用いて、適切な文字列を数字に型変換することができる。これらの型変換ディレクティブは、データの予備知識を用いて手動で指定することもでき、スキーマ推論中に収集された統計を用いて自動で推論することもできる。 Type conversion One common conversion is to convert values from one type to another. For example, since JSON does not define data types, it is common to store dates as strings (eg, according to ISO 8601) or numbers (eg, in seconds from a UNIX epoch). If the destination supports data types (most relational databases do), type conversion directives can be added to convert values to dates. Similar directives can be used to cast appropriate strings into numbers. These type conversion directives can also be specified manually using prior knowledge of the data, or automatically inferred using statistics collected during schema inference.

データクリーニング
様々な他のデータクリーニング操作を、エクスポート中に行うことができる。例えば、ある属性があるドメインにあることが分かっている場合（例えば、州コードを表す文字列または郵便番号を表す数字）、値は、省略できるまたは、デフォルトに変換できる。 Data Cleaning Various other data cleaning operations can be performed during export. For example, if an attribute is known to be in a domain (eg, a string representing a state code or a number representing a zip code), the value can be omitted or converted to a default.

データの分割と結合
上記のように、システムは、配列及びマップからのデータを別々のテーブルに分割してよいが、一部のケースでは、ユーザは、関係スキーマに対して追加の制御を望む場合がある。例えば、ユーザは、組み合わせて同じ宛先にしたい複数のソース、または、その逆、を有する場合がある。テーブルを組み合わせるために、ユーザは、結合キーを指定してよく、結合は、エクスポートの前にインデックスストアで行うことができる。 Partitioning and Combining Data As noted above, the system may partition data from arrays and maps into separate tables, but in some cases, when the user wants additional control over the relationship schema There is. For example, a user may have multiple sources that want to be combined into the same destination, or vice versa. To combine the tables, the user may specify a join key, and the join can be done at the index store prior to export.

データを分割するために、別個のテーブルに配置するためのカラムのセットを識別する。グローバルに固有のレコードｉｄを結合キーとして用いることができ、テーブルは、別々にエクスポートされる。これらのカラムは関連する属性セットを識別するために、手動で、または、上記のような統計的方法を用いて指定することができる。 Identify the set of columns to place in a separate table to split the data. Globally unique record ids can be used as join keys, and the tables are exported separately. These columns can be specified manually or using statistical methods as described above to identify the associated attribute set.

シュレッディングと呼ばれる操作によって、単一の属性の値に基づいて、データをテーブルにパーティション化する。これは、単一の属性の値がレコードの型を決定するイベントのようなデータソースについては、特に有用である。どのカラムが各ｉｄと関連付けられているかを特定する統計を収集することができ、各レコード型について別個のテーブルをエクスポートすることができる。 Data is partitioned into tables based on the value of a single attribute by an operation called shredding. This is particularly useful for data sources such as events where the value of a single attribute determines the type of record. Statistics can be collected identifying which columns are associated with each id, and separate tables can be exported for each record type.

採集ターゲット
データウェアハウス
ＥＴＬプロセスの１つの可能な出力先（または、ターゲット）は、データウェアハウスである。データウェアハウスは、ローフォーマット（例えば、ＣＳＶ）で出力ファイルのセットを作成することによって、動的スキーマの変更に従ってデータウェアハウスを適合させるための対応する一連のＡＬＴＥＲＴＡＢＬＥ／ＣＯＬＵＭＮコマンドと共にロードされてよい。 Collection Target Data Warehouse One possible destination (or target) of the ETL process is a data warehouse. The data warehouse is loaded with a series of corresponding ALTER TABLE / COLUMN commands to adapt the data warehouse according to dynamic schema changes by creating a set of output files in raw format (eg CSV) Good.

Ｖｅｒｔｉｃａ、Ｇｒｅｅｎｐｌｕｍ、Ａｓｔｅｒ／Ｔｅｒａｄａｔａ、及び、ＡｍａｚｏｎＲｅｄｓｈｉｆｔの製品を含む、分析用に設計された一部データウェアハウスは、異なるカラムを別々に記憶するカラム指向のストレージを用いる。これらのシステムにおいては、新しいカラムの追加は、既存データの修正を必要とせず、より効率が良いと思われる。このようなデータウェアハウスへのファイルのロードは、データウェアハウスがサポートするコマンドを用いて行ってよい。複数の並列ロードを受け入れることができる宛先への供給について以下に記載する。 Some data warehouses designed for analytics, including the products of Vertica, Greenplum, Aster / Teradata, and Amazon Redshift, use column-oriented storage that stores different columns separately. In these systems, adding new columns does not require modification of existing data and seems to be more efficient. Loading files into such a data warehouse may be done using commands supported by the data warehouse. The supply of destinations that can accept multiple parallel loads is described below.

無関係オブジェクトストア
ユーザは、関係ストアの代わりに、オブジェクトストアのロードを選んでよい。データをオブジェクトストアにロードする１つの方法は、求められる出力フォーマットに従って、累積的ＪＳＯＮスキーマを用いて、オブジェクトをシリアライズすることである。 Irrelevant Object Store Users may choose to load the object store instead of the relational store. One way to load data into the object store is to serialize the object using a cumulative JSON schema according to the required output format.

オーケストレーション
ＥＴＬプロセスの個々の構成要素は、分散コンピュータ環境内、すなわち、クラウドで、または、プライベートデータセンター内で、スケジュール、及び、実行することができる。スケジューリングは、データソース及び宛先の性質、インデックスストアが用いられているか否か、並びに、求められる並列処理の程度に応じて決まってよい。パイプラインの各段階の状態は、回復、統計、及び、系列をサポートするためにメタデータサービスに記憶されてよい。 The individual components of the orchestration ETL process can be scheduled and run in a distributed computing environment, ie, in the cloud or in a private data center. The scheduling may depend on the nature of the data source and destination, whether an index store is being used, and the degree of parallelism desired. The state of each stage of the pipeline may be stored in a metadata service to support recovery, statistics, and sequences.

ＥＴＬプロセスの実施例は、次の構成要素にセグメント化されてよい。すなわち、新しいデータの検出（Ｄ）、ソースからデータを抽出（Ｓ）、ソースのスキーマを推論（Ｉ）、オプションでインデックスストアをロード（Ｌ）、累積スキーマを生成（Ｇ）、ａｌｔｅｒｔａｂｌｅステートメントの生成（Ａ）、オプションで、中間フォーマットでデータをエクスポート（Ｅ）、及び、データを宛先にコピー（Ｃ） An embodiment of the ETL process may be segmented into the following components: That is, detect new data (D), extract data from source (S), infer source schema (I), optionally load index store (L), generate cumulative schema (G), alter table statement Generate (A), optionally export data in an intermediate format (E), and copy data to a destination (C)

検出（Ｄ）プロセスの実施例は、下記の疑似コードを用いて記述される。
old_metadata = None
while True:
metadata = source.get_metadata()
if old_metadata == metadata:
# nothing new here
sleep if necessary to prevent overwhelming source
continue
# find new stuff
new_stuff = metadata - old_metadata
# build chunks of appropriate size
chunk = new Chunk()
for item in new_stuff:
chunk.add_data(item)
if chunk.big_enough():
mds.record_chunk(chunk)
chunk = new Chunk() An example of the detection (D) process is described using the pseudo code below.
old_metadata = None
while True:
metadata = source.get_metadata ()
if old_metadata == metadata:
# nothing new here
sleep if necessary to prevent overwhelming source
continue
# find new stuff
new_stuff = metadata-old_metadata
# build chunks of appropriate size
chunk = new Chunk ()
for item in new_stuff:
chunk.add_data (item)
if chunk.big_enough ():
mds.record_chunk (chunk)
chunk = new Chunk ()

インデックスストアを用いた採集プロセスは、データ抽出（Ｓ）、スキーマ推論（Ｉ）、及び、インデックスストアのロード（Ｌ）を含んでよく、それらは合わせてＳＩＬと呼ばれる。ＳＩＬプロセスの例の疑似コードを下記に示す、ここで、インデックスストアは、“ＩＳＳ”と呼ぶ。
while True:
# Get any unprocessed chunk of source data
chunk = mds.get_any_unprocessed_chunk()
# code below can be in an asynchronous task
cumulative = new Schema()
for record in chunk.read():
schema = get_schema(record)
cumulative.update(schema)
iss.write(record)
iss.commit_chunk()
mds.save_chunk_schema(chunk, cumulative) The collection process using the index store may include data extraction (S), schema inference (I), and load index store (L), which are collectively referred to as SIL. Pseudo code for an example of the SIL process is shown below, where the index store is called "ISS".
while True:
# Get any unprocessed chunk of source data
chunk = mds.get_any_unprocessed_chunk ()
# code below can be in an asynchronous task
cumulative = new Schema ()
for record in chunk.read ():
schema = get_schema (record)
cumulative.update (schema)
iss.write (record)
iss.commit_chunk ()
mds.save_chunk_schema (chunk, cumulative)

スキーマ生成プロセス（Ｇ）の例の疑似コードは下記のようになる。
parent = None
while True:
# get a few committed chunks such that they form a
# reasonable copy size
chunks = mds.get_committed_chunks()
# generate a cumulative schema across loads
cumulative = Schema()
for chunk in chunks:
cumulative.update(mds.get_schema(chunk))
parent = mds.store_new_cumulative(cumulative, chunks,
parent) The pseudo code of the example of the schema generation process (G) is as follows.
parent = None
while True:
# get a few committed chunks such that they form a
# reasonable copy size
chunks = mds.get_committed_chunks ()
# generate a cumulative schema across loads
cumulative = Schema ()
for chunk in chunks:
cumulative.update (mds.get_schema (chunk))
parent = mds.store_new_cumulative (cumulative, chunks,
parent)

インデックスストアからのエクスポートプロセス（Ｅ）の例の疑似コードは下記のようになる。
while True:
cumulative, chunks = mds.get_unexpoeted_schema()
# code below can be in an asynchronous task
# export the new chunks according to the
# cumulative schema
intermediate = new Intermediate()
for record in iss.get_chunks(chunks):
export_data = prepare_output(record, cumulative)
intermediate.append(export_data)
mds.record_schema_export(cumulative, intermediate) The pseudo code of the example of export process (E) from index store is as follows.
while True:
cumulative, chunks = mds.get_unexpoeted_schema ()
# code below can be in an asynchronous task
# export the new chunks according to the
# cumulative schema
intermediate = new Intermediate ()
for record in iss.get_chunks (chunks):
export_data = prepare_output (record, cumulative)
intermediate.append (export_data)
mds.record_schema_export (cumulative, intermediate)

テーブル変更（ａｌｔｅｒｔａｂｌｅ）要素（Ａ）及びコピー要素（Ｃ）を含む、累積的にＡＣと呼ばれる、プロセスの例の疑似コードは下記のようになる。
while True:
schema, output = mds.get_uncopied_output()
previous = mds.get_parent_schema(schema)
# generate alter table statements
if schema != previous:
difference = schema - previous
warehouse.alter_table(difference)
warehouse.copy_from(output) Pseudocode for an example of a process, cumulatively called AC, containing the alter table element (A) and the copy element (C) is as follows:
while True:
schema, output = mds.get_uncopied_output ()
previous = mds.get_parent_schema (schema)
# generate alter table statements
if schema! = previous:
difference = schema-previous
warehouse.alter_table (difference)
warehouse.copy_from (output)

インデックスストアを用いず、データが、データソースからターゲットにストリーム配信される場合、プロセス全体は、ＳＩＥＧＣＡと呼んでよく、データ抽出（Ｓ）、スキーマ推論（Ｉ）、データエクスポート（Ｅ）、コピー（Ｃ）、及び、ターゲットのテーブルの変更（Ａ）を含む。ＳＩＥＧＣＡプロセスの例の疑似コードは下記のようになる。
cumulative = new Schema()
intermediate = new Intermediate()
while True:
# Get any unprocessed chunk of source data
chunk = mds.get_any_unprocessed_chunk()
# code below can be in an asynchronous task
for record in chunk.read():
schema = get_schema(record)
if schema != cumulative:
mds.record_schema_export(cumulative,
intermediate)
cumulative.update(schema)
intermediate = new Intermediate()
intermediate.append(record)
iss.write(record)
mds.save_chunk_schema(chunk, cumulative) If the data is streamed from the data source to the target without using an index store, the whole process may be called SIEGCA, data extraction (S), schema inference (I), data export (E), copying ( C) and change of target table (A). The pseudo code for the example SIEGCA process is as follows:
cumulative = new Schema ()
intermediate = new Intermediate ()
while True:
# Get any unprocessed chunk of source data
chunk = mds.get_any_unprocessed_chunk ()
# code below can be in an asynchronous task
for record in chunk.read ():
schema = get_schema (record)
if schema! = cumulative:
mds.record_schema_export (cumulative,
intermediate)
cumulative.update (schema)
intermediate = new Intermediate ()
intermediate.append (record)
iss.write (record)
mds.save_chunk_schema (chunk, cumulative)

図１６Ａ、１６Ｂにおいて、本開示の原理に従った、ＥＴＬプロセスの構成要素の並列化の依存関係図を示す。図１６Ａは、中間インデックスストアの使用を示している。検出プロセス（Ｄ）は、６つのサブプロセス１６００−１．．．１６００−６に分けて示されている。依存関係は、矢印で示され、ここで、Ｄ₂ １６００−２は、Ｄ₁ １６００−１等に依存している。言い換えれば、Ｄ₂は、Ｄ₁がが完了しないと、完了できない。ここで説明が簡単なように記載しているように、依存関係が厳密な場合、Ｄ₂は、Ｄ₁が完了するまで、開始することさえできない。これは、Ｄ₁が完了して、Ｄ₂に開始点を残すまで、Ｄ₂は、新しいデータの検出をどこから始めるべきかも分からないからであろう。ほんの一例として、検出サブプロセスＤ₁は、最初の１０，０００個の新しいオブジェクトを取得するように構成されてよい。サブプロセスＤ₂は、サブプロセスＤ₁が中止した場所から初めて、次の１０，０００個の新しいオブジェクトを識別する。 16A, 16B illustrate dependency diagrams of parallelization of components of the ETL process, in accordance with the principles of the present disclosure. FIG. 16A illustrates the use of an intermediate index store. The detection process (D) comprises six sub-processes 1600-1. . . It is shown divided into 1600-6. The dependency is indicated by an arrow, where D ₂ 1600-2 is dependent on D ₁ 1600-1, etc. In other words, D _2, when D ₁ is not is completed, can not be completed. As described here for simplicity, if the dependency is strict, D ₂ can not even start until D ₁ completes. This is, and D ₁ is completed, to leave the starting point to the D _{_2,} D ₂ is, probably because do not know whether to start the detection of new data from anywhere. By way of example only, detection subprocess D ₁ may be configured to acquire the first 10,000 new object. Sub-process D _2, for the first time from the location where the sub-process D ₁ has stopped, identifies the next 10,000 new object.

図１６Ａ、１６Ｂに示す依存関係は、一定の状況においては、断たれてよい。例えば、Ｄ₁及びＤ₂が、別々のデータソース、または、データソースの異なった部分を調べている場合、Ｄ₂は、Ｄ₁の完了を待たずに開始してよい。 The dependencies shown in FIGS. 16A, 16B may be broken in certain circumstances. For example, if D ₁ and D ₂ are looking at different data sources or different parts of data sources, D ₂ may start without waiting for D ₁ to complete.

抽出プロセス（Ｓ）は、サブプロセス１６０４−１．．．１６０４−６を含む。各抽出サブプロセス１６０４は、各検出ステップ１６００に依存する。この依存関係は、ソースから抽出するファイル／オブジェクト／レコードは、検出サブプロセス中に識別されるという事実から生じる。 The extraction process (S) comprises sub-processes 1604-1. . . 1604-6. Each extraction sub-process 1604 depends on each detection step 1600. This dependency arises from the fact that files / objects / records extracted from the source are identified during the detection sub-process.

スキーマ推論は、抽出サブプロセス（Ｓ）で抽出されたレコードに関して行われるので、１６０８−１．．．１６０８−６で示す各推論サブプロセス（Ｉ）は、各抽出サブプロセス（Ｓ）に依存する。推論される（Ｉ）各レコードについては、そのレコードは、中間ストレージにロードされる（Ｌ）ので、ロードサブプロセス１６１２−１．．．１６１２−６は、それぞれ、各推論サブプロセスに従属する。 Since schema inference is performed on records extracted in the extraction subprocess (S), 1608-1. . . 1608- Each inference subprocess indicated by 6 (I) is dependent on the extraction sub-process (S). For each record that is inferred (I), that record is loaded into intermediate storage (L), so load sub-process 1612-1. . . Each of 1612-6 is subordinate to each inference sub-process.

１６１６−１．．．１６１６−３で示されるスキーマ生成サブプロセス（Ｇ）は、以前のスキーマを取得し、１つまたは複数のサブプロセスからの新しく推論されたスキーマを追加して、新しい累積スキーマを生成する。ヒューリスティックを用いて、推論サブプロセスが幾つ、単一のスキーマ生成サブプロセスに供給されるかを決定してよい。図１６Ａに見られるように、数字は変数であってよい。 1616-1. . . A schema generation sub-process (G), shown at 1616-3, obtains the previous schema and adds the newly inferred schema from one or more sub-processes to generate a new cumulative schema. Heuristics may be used to determine how many inference sub-processes are provided to a single schema-generation sub-process. As seen in FIG. 16A, the numbers may be variables.

抽出サブプロセス１６２０−１．．．１６２０−３は、生成されたスキーマを各生成サブプロセスから受信し、生成サブプロセスに対応するロードサブプロセスからデータを抽出する。抽出サブプロセス１６２０は、データウェアハウス等のターゲットにロードするための中間ファイルのセットを構築してよい。抽出サブプロセスを含む各サブプロセスは、内部並列化を活用できるように最適化されてよい。 Extraction sub-process 1620-1. . . 1620-3 receives the generated schema from each generation sub-process and extracts data from the load sub-process corresponding to the generation sub-process. The extraction sub-process 1620 may build a set of intermediate files for loading into a target, such as a data warehouse. Each subprocess, including the extraction subprocess, may be optimized to take advantage of internal parallelism.

生成サブプロセスに基づいて、テーブル変更サブプロセス１６２４−１．．．１６２４−３は、ターゲットが、スキーマサブプロセス１６１６が決定したスキーマの任意の新しいオブジェクトを収容するためのコマンドを生成する。例えば、Ａ₂は、任意の追加のオブジェクトがＧ₂によって累積スキーマに追加されたオブジェクトか否かを、Ｇ₁の後既に存在したオブジェクトと比較して、決定する。ターゲットのためのコマンドは、データ定義言語（ＤＤＬ：ＤａｔａＤｅｆｉｎｉｔｉｏｎＬａｎｇｕａｇｅ）ステートメントの形式をとってよい。テーブル変更サブプロセスとしてラベル付けされているが、実際の言語またはステートメントは、特定のステートメント“ａｌｔｅｒｔａｂｌｅ”を用いなくてもよい。累積スキーマが、テーブル変更サブプロセス１６２４によってターゲットに反映された時点で、コピーサブプロセス１６２８−１．．．１６２８−３において、データはターゲットにコピーされる。 Table modification sub-process 1624-1. . . 1624-3, the target generates a command to accommodate any new objects of the schema determined by the schema sub-process 1616. For example, A ₂ is any additional objects whether objects added to the cumulative schema by G _2, as compared with object already exists after G _1, is determined. The commands for the target may take the form of Data Definition Language (DDL) statements. Although labeled as a alter table subprocess, the actual language or statement may not use the specific statement "alter table". When the cumulative schema is reflected on the target by the table change subprocess 1624, the copy subprocess 1628-1. . . At 1628-3, data is copied to the target.

図１６Ｂにおいて、ストリーミングアプローチを示す。ここでは、カラムまたはインデックスストア等の中間ストアはない。代わりに、データは、ターゲットに直接、採集され、１６１２のロードサブプロセスは省略される。結果として、抽出サブプロセス１６２０は、スキーマ推論サブプロセス１６０８に直接依存する。この実装においては、テーブル変更サブプロセス１６２４は、コピーサブプロセス１６２８に依存し、図１６Ａの依存関係と逆であることに留意する。 In FIG. 16B, a streaming approach is shown. Here there are no intermediate stores such as columns or index stores. Instead, data is collected directly to the target and the 1612 load sub-process is omitted. As a result, the extraction sub-process 1620 relies directly on the schema inference sub-process 1608. Note that in this implementation, the alter table sub-process 1624 relies on the copy sub-process 1628 and is the reverse of the dependencies of FIG. 16A.

ジョブの弾力的なスケジューリング
インデックスストアを用いるとき、採集タスク（ＳＩＬ）及びエクスポートタスク（Ｅ）は、計算集約的であってよい。これらのタスクは、内部で並列化することができる。さらに、これらのタスクは、その依存関係が満たされれば、ジョブスケジューリングシステム（例えば、ＰＢＳ、ＬＳＦ、ＹＡＲＮ、ＭＥＳＯＳ等）を介して非同期的に発行することもできる。これらのジョブスケジューリングシステムの多くは、ステップ間の依存関係も符号化することができる。（図１６Ａに示すような)依存グラフを知ることによって、スケジューリングモジュールは、全てのノードが使用中であるが、同時に、未処理の（outstanding）ジョブがあまり多くないことを保証しているグラフで、古いジョブを発行することができる。 Collection scheduling (SIL) and export task (E) may be computationally intensive when using a flexible scheduling index store of jobs. These tasks can be parallelized internally. Furthermore, these tasks can also be issued asynchronously via a job scheduling system (eg, PBS, LSF, YARN, MESOS, etc.) if their dependencies are satisfied. Many of these job scheduling systems can also encode inter-step dependencies. By knowing the dependency graph (as shown in FIG. 16A), the scheduling module ensures that all nodes are in use, but at the same time guaranteeing that there are not many outstanding jobs. , Can issue old jobs.

エラー回復
エラー回復の目的で、サブプロセスの一部または全ては、システムの状態に関するメタデータと、サブプロセスがシステムに行おうとする変化とを、記録する。ステップが失敗した場合、回復コードは、このメタデータを用いて、中断された操作を終えるか、再試行できるように不完全な操作を取り消すことができる。例えば、ロードサブプロセスは、ロード中のタプルＩＤのセットを記録する。ロードサブプロセスが失敗すると、そのセットに一致するタプルＩＤを有する全てのレコードをパージするにように指示するコマンドが、インデックスストアに発行されてよい。そうすると、ロードは再試行できる。 Error Recovery For the purpose of error recovery, some or all of the subprocesses record metadata about the state of the system and the changes that the subprocess is going to make to the system. If the step fails, the recovery code can use this metadata to terminate the interrupted operation or cancel the incomplete operation so that it can be retried. For example, the load sub-process records the set of tuple IDs being loaded. If the load sub-process fails, a command may be issued to the index store instructing it to purge all records with tuple IDs matching the set. The load can then be retried.

エクスポート最適化
一部の関係ストアは、中間ファイルなしにデータを直接ロードする機構を有している。例えば、Ｐｏｓｔｇｒｅｓは、ＣＯＰＹＦＲＯＭＳＴＤＩＮコマンドをサポートしており、データを直接データベースに供給することができる。エクスポートプロセスは、このインタフェースを用いて、データを直接、出力システムに書き込むことができるので、エクスポート（Ｅ）ステップとコピー（Ｃ）ステップをマージすることができる。ＡｍａｚｏｎのＲｅｄｓｈｉｆｔ等、一部のシステムは、リモートプロシージャコールを介して、ウェアハウスから直接、データをインデックスストアに引き出してくる機構を有する。この場合、ユーザは、マニフェストァイルを作成し、コピーするために発行するセキュアシェル（ｓｓｈ）コマンドのセットをリストにする。各ｓｓｈコマンドは、ホストと、そのホスト上で実行するコマンドと、を指定する。タプルＩＤのセットをインデックスストアから抽出するコマンドを指定することによって、宛先データベースは、エクスポートのために必要なレコード／オブジェクトを、インデックスストアから引き出すことができる。 Export Optimization Some relationship stores have a mechanism to load data directly without intermediate files. For example, Postgres supports the COPY FROM STDIN command, which can supply data directly to the database. The export process can use this interface to write data directly to the output system, so it can merge the export (E) and copy (C) steps. Some systems, such as Amazon's Redshift, have a mechanism to pull data into the index store directly from the warehouse via remote procedure calls. In this case, the user creates a manifest file and lists the set of secure shell (ssh) commands issued to copy. Each ssh command specifies a host and a command to be executed on that host. By specifying a command to extract a set of tuple IDs from the index store, the destination database can pull out the records / objects necessary for export from the index store.

監視
リソース監視
監視システムは、システムが使用するハードウェアリソースとソフトウェアリソースを追跡する。それらリソースは、コレクタ、採集パイプライン、メタデータストア、（オプションの）インデックスストア、及び、データウェアハウスが使用する計算リソース及びストレージリソースを含んでよい。監視システムは、ＣＰＵ、メモリ、ディスクの使用率などを含むが、それらに限られないメトリクスと、ネットワーク要求への応答と、を追跡する。 Monitoring Resource Monitoring The monitoring system tracks hardware and software resources used by the system. These resources may include collectors, collection pipelines, metadata stores, (optional) index stores, and computational and storage resources used by the data warehouse. The monitoring system tracks metrics such as, but not limited to, CPU, memory, disk utilization, etc. and responses to network requests.

監視サービスが、（パブリックであってもプライベートであっても）クラウド内に、または、プログラム的に割り当てられたリソースを有する別のシステムに配置される場合、監視システムを用いて、必要に応じて、システムのスケールアップまたはスケールダウンを自動的に行うことができる。例えば、インデックスストアのストレージスペース（ＡｍａｚｏｎＥＢＳ等のサービスを用いる仮想ハードディスクの形をとってよい)が十分でないことを監視サービスが検出した場合、監視サービスは、ユーザの介入なしに、追加のストレージを自動的に供給する要求をトリガすることができる。同様に、採集パイプラインが用いるワーカーマシンが、日常的にＣＰＵ使用率が低い場合、監視サービスは、そのマシンをシャットダウンすることができる。 If the monitoring service is deployed in the cloud (whether public or private) or in another system with programmatically assigned resources, using the monitoring system, as needed The system can be scaled up or down automatically. For example, if the monitoring service detects that the index store's storage space (which may take the form of a virtual hard disk using a service such as Amazon EBS) is not sufficient, the monitoring service will not need additional storage without user intervention. The request to supply automatically can be triggered. Similarly, if the worker machine used by the collection pipeline routinely has low CPU usage, the monitoring service can shut down that machine.

リソース監視機能は、Ｎａｇｉｏｓ等の監視フレームワークに依存してよい。 The resource monitoring function may rely on a monitoring framework such as Nagios.

採集監視
基本的なリソース監視に加えて、追加の監視を、特に、採集パイプラインに対して行うことができる。採集プロセスの各段階(スキーマ推論、インデックスストアのロード、スキーマ計算など)に関するメタデータを、採集中にセントラルメタデータストアに記憶することができ、そのメタデータを用いてシステムに関するメトリクスをインテロゲートする（問い合わせる）ことができる。最も基本的なレベルでは、この機構を用いて、停止したプロセスまたは正確に機能していないプロセスを識別することができる。 Collection monitoring In addition to basic resource monitoring, additional monitoring can be performed, particularly on the collection pipeline. Metadata about each step of the collection process (schema inference, index store loading, schema calculations, etc.) can be stored in the central metadata store during collection, and that metadata can be used to interrogate metrics about the system You can do it (inquire). At the most basic level, this mechanism can be used to identify stopped processes or processes that are not functioning properly.

監視システムは、どの段階が自動的に再開するのか、及び、どのくらい長く各段階がかかるのかを追跡することもできる。これによって、サービスの問題の識別を支援してよい。さらに、監視システムからの履歴データは、ユーザデータの異常を示し得る。例えば、一定量のソースデータのロードにかかる時間が大きく変わった場合、そのデータのどの特性が変更された可能性があるのかをユーザが調べるために、そのデータにフラグを立てることができる。 The monitoring system may also track which stages restart automatically and how long each stage takes. This may help identify service problems. Additionally, historical data from the monitoring system may indicate anomalies in user data. For example, if the time taken to load a fixed amount of source data changes significantly, the data can be flagged to determine which characteristics of that data may have changed.

クエリ監視
監視システムは、インデックスストアまたはデータウェアハウスに対して行われるクエリの実行時間を監視することができる。類似のクエリの実行時間がどのように変化するかを観察することによって、システムの問題の可能性、及び、ソースデータの特性の変化を、識別することができる。この情報は、データウェアハウスにおけるインデックスの使用、または、インデックスストアにおけるクエリ計画を知らせることができる。例えば、めったにクエリされないカラムは、専用インデックスを有する必要はないかもしれず、クエリ時に、一定の方法でソートされることが多いカラムは、頻繁なクエリに従ってソートされた新しいインデックスを有することによって利益を得てよい。 Query Monitoring The monitoring system can monitor the execution time of queries made against the index store or data warehouse. By observing how the execution time of similar queries changes, it is possible to identify potential problems in the system and changes in the characteristics of the source data. This information can inform the use of indexes in the data warehouse or query plans in the index store. For example, columns that are rarely queried may not need to have a dedicated index, and columns that are often sorted in a fixed way at query time benefit from having new indexes sorted according to frequent queries You may

結論
上記は、事実上単なる例示にすぎず、その開示、用途、または使用に限定することを意図してはいない。開示の広範な教示は、様々な形態で実装することができる。従って、この開示は特定の例を含むが、開示の真の範囲は、それに限定されるべきではない。図面、明細書、及び、特許請求の範囲を検討すると、他の変更形態が明らかになるであろう。本明細書においては、Ａ、Ｂ、及び、Ｃの少なくとも１つという文言は、非排他的論理ＯＲを使用する論理（ＡまたはＢまたはＣ）を意味すると解釈されるべきである。方法内の１つまたは複数のステップは、本開示の原理を変えること無く、異なる順序で（または同時に）実行されてよいことは理解されたい。 Conclusion The above is merely exemplary in nature and is not intended to be limited to that disclosure, use, or use. The broad teachings of the disclosure can be implemented in various forms. Thus, although the disclosure includes specific examples, the true scope of the disclosure should not be limited thereto. Other modifications will be apparent upon review of the drawings, the description and the claims. As used herein, the term at least one of A, B, and C should be taken to mean logic (A or B or C) that uses non-exclusive logic OR. It is to be understood that one or more steps in the method may be performed in a different order (or simultaneously) without changing the principles of the present disclosure.

以下の定義を含むこの出願において、モジュールという用語は、回路という用語と置き換え可能である。モジュールという用語は、特定用途向け集積回路（ＡＳＩＣ）、デジタル、アナログ、もしくはアナログ／デジタル混合ディスクリート回路、デジタル、アナログ、もしくはアナログ／デジタル混合集積回路、組み合わせ論理回路、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、コードを実行する（共有、専用、もしくはグループ）プロセッサ、プロセッサが実行するコードを記憶する（共有、専用、もしくはグループ）メモリ、記載した機能を提供する他の適切なハードウェア構成要素、またはシステムオンチップにおけるものなどの上記の一部もしくは全ての組み合わせ、を指してよく、それらの一部であってよく、または、それらを含み得る。 In this application, including the following definitions, the term module is interchangeable with the term circuit. The term module refers to application specific integrated circuits (ASICs), digital, analog or mixed analog / digital discrete circuits, digital, analog or mixed analog / digital integrated circuits, combined logic circuits, field programmable gate arrays (FPGAs), Processor that executes code (shared, dedicated, or group), memory that stores code that processor executes (shared, dedicated, or group) memory, other suitable hardware components that provide the described functionality, or system on It may refer to, be part of, or include any combination of the above or any of the above, such as in a chip.

コードという用語は、上記で使用される際、ソフトウェア、ファームウェア、及び／またはマイクロコードを含んでよく、プログラム、ルーチン、関数、クラス、及び／またはオブジェクトのことを指してよい。共有プロセッサという用語は、多数のモジュールから幾つかのまたは全てのコードを実行する単一プロセッサを包含する。グループプロセッサという用語は、追加プロセッサと組み合わせて、１つまたは複数のモジュールから幾つかのまたは全てのコードを実行するプロセッサを包含する。共有メモリという用語は、多数のモジュールから幾つかのまたは全てのコードを記憶する単一メモリを包含する。グループメモリという用語は、追加メモリと組み合わせて、１つまたは複数のモジュールから幾つかのまたは全てのコードを記憶するメモリを包含する。メモリという用語は、コンピュータ可読媒体という用語のサブセットであってよい。コンピュータ可読媒体という用語は、媒体を通って伝搬する一時的な電気及び電磁信号を包含せず、従って、有形及び非一時的であると考えてよい。非一時的な有形のコンピュータで可読媒体の例は、不揮発性メモリ、揮発性メモリ、磁気ストレージ、及び、光ストレージを含むが、それらに限定されない。 The term code, as used above, may include software, firmware, and / or microcode, and may refer to programs, routines, functions, classes, and / or objects. The term shared processor encompasses a single processor that executes some or all of the code from a number of modules. The term group processor encompasses processors that execute some or all code from one or more modules in combination with additional processors. The term shared memory encompasses a single memory that stores some or all of the code from a number of modules. The term group memory encompasses memory which, in combination with additional memory, stores some or all of the code from one or more modules. The term memory may be a subset of the term computer readable medium. The term computer readable medium does not encompass temporal electrical and electromagnetic signals propagating through the medium, and may therefore be considered to be tangible and non-temporary. Examples of non-transitory tangible computer readable media include, but are not limited to, non-volatile memory, volatile memory, magnetic storage, and optical storage.

この出願に記載された装置及び方法は、１つまたは複数のプロセッサによって実行される１つまたは複数のコンピュータプログラムによって部分的にまたは全体的に実装されてよい。コンピュータプログラムは、少なくとも１つの非一時的な有形のコンピュータ可読媒体上に記憶されるプロセッサ実行可能命令を含む。コンピュータプログラムはまた、記憶されたデータを含んでよく、及び／または、記憶されたデータに依存してよい。 The apparatus and methods described in this application may be implemented in part or in whole by one or more computer programs executed by one or more processors. The computer program comprises processor executable instructions stored on at least one non-transitory tangible computer readable medium. The computer program may also include stored data and / or rely on stored data.

Claims

A schema inference module configured to dynamically create a cumulative schema of objects retrieved from a first data source, comprising:
Each of the retrieved objects includes (i) data and (ii) metadata describing the data,
Dynamic creation of the cumulative schema includes, for each object of the retrieved object, (i) inferring a schema from the object; and (ii) describing the object in accordance with the inferred schema. Selectively updating the schema, and
The schema inference module
Collecting statistics on the data types of the retrieved objects;
Determine whether the data of the retrieved object is correctly typed based on statistics on the data type.
Said schema inference module being configured as
An export module configured to output the data of the retrieved object to a data destination system according to the cumulative schema;
Data conversion system comprising:

The data conversion system of claim 1, wherein the data destination system comprises a data warehouse.

The data conversion system of claim 2, wherein the data warehouse stores relational data.

The data conversion system according to claim 3, wherein the export module is configured to convert the cumulative schema into a relational schema, and output the data of the extracted object to the data warehouse according to the relational schema. .

The export module comprises at least one
Generating a command to the data warehouse to update the schema of the data warehouse to reflect changes made to the relationship schema;
Creating at least one intermediate file from the data of the retrieved object according to the relationship schema;
Configured to perform
The at least one intermediate file has a predetermined data warehouse format,
5. The data conversion system of claim 4, wherein the export module is configured to bulk load the at least one intermediate file into the data warehouse.

The data conversion system according to any one of claims 1 to 5, further comprising an index store configured to store the data in a column format from the extracted object.

7. The data conversion system of claim 6, wherein the export module is configured to generate raw based data from the stored data of the index store.

The schema inference module is configured to create a time index in the index store that maps a time value to an identifier of the retrieved object;
For each retrieved object of the retrieved object, the time value may be (i) a transaction time corresponding to creation of the retrieved object;
The data conversion system according to claim 6 or 7, wherein (ii) at least one of valid times corresponding to the extracted object is represented.

(I) caching additional objects for later storage in the index store, and (ii) bulk loading the index store in response to the size of the cache reaching a threshold. 9. The data conversion system of any one of claims 6-8, further comprising a write optimization store configured to package additional objects.

The schema inference module is:
The metadata of the retrieved object, or the data of the retrieved object,
Configured to collect statistics on at least one of
The data conversion system according to any one of claims 1 to 9, wherein the statistics include at least one of minimum, maximum, average, and standard deviation.

Further comprising a data collector module configured to receive relational data from the first data source and to generate the object for use by the schema inference module;
The data collector module includes (i) a first column indicating a table for extracting each item of the relationship data, and (ii) a second column indicating a time stamp associated with each item of the relationship data. The data conversion system according to any one of claims 1 to 10, wherein the relational data is configured to be eventified by creating.

The data conversion system according to any one of claims 1 to 11, further comprising a scheduling module configured to assign processing of a job to the schema inference module and the export module according to predetermined dependency information.

The export module is configured to partition the cumulative schema into a plurality of tables, each table of the plurality of tables including a column that appears together with the retrieved object,
Prior Symbol export module, a corresponding group of said taken-out object, respectively, according to a column in the group, to have different values for the identifier element is configured to partition the cumulative schema claim 1 The data conversion system of any one of -12.

The schema inference module records a source identifier for each of the retrieved objects;
14. For each object of the retrieved object, the source identifier comprises the unique identifier of the first data source and the location of the object in the first data source. The data conversion system described in 1 or 2.

A data analysis system operating method,
Retrieving an object from a data source, wherein each object of the retrieved object comprises (i) data and (ii) metadata describing the data.
For each object of the retrieved object,
(I) inferring a schema from the object based on the metadata of the object and an inferred data type of an element of the data of the object, at least one object of the objects Said inferring schema being different from other inferred schemas for other objects of said objects;
(Ii) creating an integrated schema that describes both the object described by the inferred schema and the set of accumulated objects described by the accumulated schema;
(Iii) storing the integrated schema as the cumulative schema;
Collecting statistics on data types of the retrieved objects;
Determining whether the data of the retrieved object is correctly typed based on the statistics on the data type;
Exporting the data of each of the retrieved objects to a data warehouse;
Dynamically generating the cumulative schema by
Method including.