WO2025238518A1

WO2025238518A1 - Accelerator processing system

Info

Publication number: WO2025238518A1
Application number: PCT/IB2025/054959
Authority: WO
Inventors: Oded Trainin; Yoav MARKUS; Noam Lieberman; Ilia KONIKOV; Elad BAR HANIN; Yishay BEN TOLILA; Meir NADAM
Original assignee: Neuroblade Ltd
Current assignee: Neuroblade Ltd
Priority date: 2024-05-12
Filing date: 2025-05-12
Publication date: 2025-11-20
Anticipated expiration: 2026-11-12

Abstract

A hardware based, programmable data analytics processor configured to reside between a data storage unit and one or more hosts. The programmable data analytics processor includes a decompression module, a decoding module, a selector module, a dictionary module, a fitter module, a filter and project module, a join and group module, and a communications fabric configured to transfer data between any of the modules.

Description

Accelerator Processing System

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of priority of United States Provisional Patent Application No. 63/645,904, filed on May 12, 2024. The foregoing application is incorporated herein by reference in its entirety.

TECHNICAL FIELD

The present disclosure generally relates to data processing, and in particular, it concerns streaming data processing.

BACKGROUND

Many modern applications are limited by data communication between storage and processing. Current solutions include adding levels of data cache and re-layout of hardware components. For example, current solutions for data analytics applications have limitations including: (1) Network bandwidth (BW) between storage and processing, (2) network bandwidth between CPUs, (3) memory size of CPUs, (4) inefficient data processing methods, and (5) access rate to CPU memory.

In addition, data analytics solutions have significant challenges in scaling up. For example, when trying to add more processing power or memory, more processing nodes are required, therefore more network bandwidth between processors and between processors and storage is required, leading to network congestion. SUMMARY

A system, including a hardware based, programmable data analytics processor configured to reside between a data storage unit and one or more host processors, wherein the programmable data analytics processor includes: a decompression module configured to input a first set of data, the first set of data being compressed data, restore the compressed data to the compressed data’s uncompressed data form, and output the uncompressed data; a decoding module configured to input the uncompressed data, apply one or more decoding functions to at least a portion of the uncompressed data to generate decoded data as a first set of data; a selector module configured to input the decoded data, the decoded data being based on the first set of data and, based on a selection indicator output a first subset of the first set of data; a dictionary module configured to input the first subset of data, replace at least a portion of the first subset of data with corresponding data, and output lookup data; a fitter module configured to input the lookup data, change the size of at least a portion of the lookup data to desired sizes, and output fitted data; a filter and project module configured to input a second set of data and, based on a function, output an updated second set of data; a join and group module configured to combine data from one or more third data sets into a combined data set; and a communications fabric configured to transfer data between any of the decompression module, the decoding module, selector module, the dictionary module, the fitter module, the filter and project module, and the join and group module.

A system includes a hardware based, programmable data analytics processor configured to reside between a data storage unit and one or more host processors, wherein the programmable data analytics processor includes: a selector module configured to input data based on a first set of data and, based on a selection indicator output a first subset of the first set of data, a filter and project module configured to input a second set of data and, based on a function, output an updated second set of data, and a join and group module configured to combine data from one or more third data sets into a combined data set, and one or more modules selected from the group consisting of: a decompression module configured to input the first set of data, the first set of data being compressed data, restore the compressed data to the compressed data’ s uncompressed data form, and output the uncompressed data, a decoding module configured to input the uncompressed data, apply one or more decoding functions to at least a portion of the uncompressed data to generate decoded data, and a dictionary module configured to input the first subset of data, replace at least a portion of the first subset of data with corresponding data, and output lookup data, and a fitter module configured to input the lookup data, change the size of at least a portion of the lookup data to desired sizes, and output fitted data, and a communications fabric configured to transfer data between any of the modules. BRIEF DESCRIPTION OF THE DRAWINGS

The embodiment is herein described, by way of example only, with reference to the accompanying drawings.

FIG. 1 is an example of a computer (CPU) architecture.

FIG. 2 is an example of a graphics processing unit (GPU) architecture.

FIG. 3 is a diagrammatic representation of a computer memory with an error correction code (ECC) capability.

FIG. 4 is a diagrammatic representation of a process for writing data to a memory module.

FIG. 5 is a diagrammatic representation of a process for reading from memory.

FIG. 6 is a diagrammatic representation of an architecture including memory processing modules.

FIG. 7 shows a host provide instructions, data, and/or other input to a memory appliance and read output from the same.

FIG. 8 is an example of implementations of processing systems and, in particular, for data analytics.

FIG. 9 is an example of a high-level architecture for a data analytics accelerator.

FIG. 10 is an example of a software layer for a data analytics accelerator.

FIG. 11 is an example of the hardware layer for a data analytics accelerator.

FIG. 12 is an example of the storage layer and bridges for a data analytics accelerator.

FIG. 13 is an example of networking for a data analytics accelerator.

DETAILED DESCRIPTION

Example Architecture

FIG. 1 is an example of a computer (CPU) architecture. A CPU 100 may include a processing unit 110 that includes one or more processor subunits, such as processor subunit 120a and processor subunit 120b. Although not depicted in the current figure, each processor subunit may include a plurality of processing elements. Moreover, the processing unit 110 may include one or more levels of on-chip cache. Such cache elements are generally formed on the same semiconductor die as processing unit 110 rather than being connected to processor subunits 120a and 120b via one or more buses formed in the substrate containing processor subunits 120a and 120b and the cache elements. An arrangement directly on the same die, rather than being connected via buses, may be used for both first-level (LI) and second-level (L2) caches in processors. Alternatively, in older processors, L2 caches were shared amongst processor subunits using back-side buses between the subunits and the L2 caches. Back-side buses are generally larger than front-side buses, described below. Accordingly, because cache is to be shared with all processor subunits on the die, cache 130 may be formed on the same die as processor subunits 120a and 120b or communicatively coupled to processor subunits 120a and 120b via one or more back-side buses. In both embodiments without buses (e.g., cache is formed directly on-die) as well as embodiments using back-side buses, the caches are shared between processor subunits of the CPU.

Moreover, processing unit 110 may communicate with shared memory 140a and memory 140b. For example, memories 140a and 140b may represent memory banks of shared dynamic random-access memory (DRAM). Although depicted with two banks, memory chips may include between eight and sixteen memory banks. Accordingly, processor subunits 120a and 120b may use shared memories 140a and 140b to store data that is then operated upon by processor subunits 120a and 120b. This arrangement, however, results in the buses between memories 140a and 140b and processing unit 110 acting as a bottleneck when the clock speeds of processing unit 110 exceed data transfer speeds of the buses. This is generally true for processors, resulting in lower effective processing speeds than the stated processing speeds based on clock rate and number of transistors.

FIG. 2 is an example of a graphics processing unit (GPU) architecture. Deficiencies of the CPU architecture similarly persist in GPUs. A GPU 200 may include a processing unit 210 that includes one or more processor subunits (e.g., subunits 220a, 220b, 220c, 220d, 220e, 220f, 220g, 220h, 220i, 220j, 220k, 2201, 220m, 220n, 220o, and 220p). Moreover, the processing unit 210 may include one or more levels of on-chip cache and/or register files. Such cache elements are generally formed on the same semiconductor die as processing unit 210. Indeed, in the example of the current figure, cache 210 is formed on the same die as processing unit 210 and shared amongst all of the processor subunits, while caches 230a, 230b, 230c, and 230d are formed on a subset of the processor subunits, respectively, and dedicated thereto.

Moreover, processing unit 210 communicates with shared memories 250a, 250b, 250c, and 250d. For example, memories 250a, 250b, 250c, and 250d may represent memory banks of shared DRAM. Accordingly, the processor subunits of processing unit 210 may use shared memories 250a, 250b, 250c, and 250d to store data that is then operated upon by the processor subunits. This arrangement, however, results in the buses between memories 250a, 250b, 250c, and 250d and processing unit 210 acting as a bottleneck, similar to the bottleneck described above for CPUs.

FIG. 3 is a diagrammatic representation of a computer memory with an error correction code (ECC) capability. As shown in the current figure, a memory module 301 includes an array of memory chips 300, shown as nine chips (i.e., chip-0, 100-0 through chip-8, 100-8, respectively). Each memory chip has respective memory arrays 302 (e.g., elements labelled 302- 0 through 302-8) and corresponding address selectors 306 (shown as respective selector-0 106-0 through selector-8 106-8). Controller 308 is shown as a DDR controller. The DDR controller 308 is operationally connected to CPU 100 (processing unit 110), receiving data from the CPU 100 for writing to memory, and retrieving data from the memory to send to the CPU 100. The DDR controller 308 also includes an error correction code (ECC) module that generates error correction codes that may be used in identifying and correcting errors in data transmissions between CPU 100 and components of memory module 301.

FIG. 4 is a diagrammatic representation of a process for writing data to the memory module 301. Specifically, the process 420 of writing to the memory module 301 can include writing data 422 in bursts, each burst including 8 bytes for each chip being written to (in the current example, 8 of the memory chips 300, including chip-0, 100-0 to chip-7, 100-7). In some implementations, an original error correction code (ECC) 424 may be calculated in the ECC module 312 in the DDR controller 308. The ECC 424 is calculated across each of the chip’s 8 bytes of data, resulting in an additional, original, 1-byte ECC for each byte of the burst across the 8 chips. The 8-byte (8xl-byte) ECC is written with the burst to a ninth memory chip serving as an ECC chip in the memory module 301, such as chip- 8, 100-8.

The memory module 301 can activate a cyclic redundancy check (CRC) check for each chip’s burst of data, to protect the chip interface. A cyclic redundancy check is an errordetecting code commonly used in digital networks and storage devices to detect accidental changes to raw data. Blocks of data get a short check value attached, based on the remainder of a polynomial division of the block’s contents. In this case, an original CRC 426 is calculated by the DDR controller 308 over the 8 bytes of data 422 in a chip’s burst (one row in the current figure) and sent with each data burst (each row / to a corresponding chip) as a ninth byte in the chip’s burst transmission. When each chip 300 receives data, each chip 300 calculates a new CRC over the data and compares the new CRC to the received original CRC. If the CRCs match, the received data is written to the chip’s memory 302. If the CRCs do not match, the received data is discarded, and an alert signal is activated. An alert signal may include an ALERT_N signal.

Additionally, when writing data to a memory module 301, an original parity 428 A is normally calculated over the (exemplary) transmitted command 428B and address 428C. Each chip 300 receives the command 428B and address 428C, calculates a new parity, and compares the original parity to the new parity. If the parities match, the received command 428B and address 428C are used to write the corresponding data 422 to the memory module 301. If the parities do not match, the received data 422 is discarded, and an alert signal (e.g., ALERT_N) is activated. FIG. 5 is a diagrammatic representation of a process 530 for reading from memory. When reading from the memory module 301, the original ECC 424 is read from the memory and sent with the data 422 to the ECC module 312. The ECC module 312 calculates a new ECC across each of the chips’ 8 bytes of data. The new ECC is compared to the original ECC to determine (detect, correct) if an error has occurred in the data (transmission, storage). In addition, when reading data from memory module 301, an original parity 538A is normally calculated over the (exemplary) transmitted command 538B and address 538C (transmitted to the memory module 301 to tell the memory module 301 to read and from which address to read). Each chip 300 receives the command 538B and address 538C, calculates a new parity, and compares the original parity to the new parity. If the parities match, the received command 538B and address 538C are used to read the corresponding data 422 from the memory module 301. If the parities do not match, the received command 538B and address 538C are discarded and an alert signal (e.g., ALERT_N) is activated.

Overview of Memory Processing Modules and Associated Appliances

FIG. 6 is a diagrammatic representation of an architecture including memory processing modules. For example, a memory processing module (MPM) 610, as described above, may be implemented on a chip to include at least one processing element (e.g., a processor subunit) local to associated memory elements formed on the chip. In some cases, an MPM 610 may include a plurality of processing elements spatially distributed on a common substrate among their associated memory elements within the MPM 610.

In the example of Fig. 6, the memory processing module 610 includes a processing module 612 coupled with four, dedicated memory banks 600 (shown as respective bank-0, 600-0 through bank-3, 600-3). Each bank includes a corresponding memory array 602 (shown as respective memory array-0, 602-0 through memory array-3, 602-3) along with selectors 606 (shown as selector-0 606-0 to selector-3 606-3). The memory arrays 602 may include memory elements similar to those described above relative to memory arrays 302. Local processing, including arithmetic operations, other logic-based operations, etc. can be performed by processing module 612 (also referred to in the context of this document as a “processing subunit,” “processor subunit,” “logic,” “micro mind,” or “UMIND”) using data stored in the memory arrays 602, or provided from other sources, for example, from other of the processing modules 612. In some cases, one or more processing modules 612 of one or more MPMs 610 may include at least one arithmetic logic units (ALU). Processing module 612 is operationally connected to each of the memory banks 600.

A DDR controller 608 may also be operationally connected to each of the memory banks 600, e.g., via an MPM slave controller 623. Alternatively, and/or in addition to the DDR controller 608, a master controller 622 can be operationally connected to each of the memory banks 600, e.g., via the DDR controller 608 and memory controller 623. The DDR controller 608 and the master controller 622 may be implemented in an external element 620.

Additionally, and/or alternatively, a second memory interface 618 may be provided for operational communication with the MPM 610.

While the MPM 610 of Fig. 6 pairs one processing module 612 with four, dedicated memory banks 600, more or fewer memory banks can be paired with a corresponding processing module to provide a memory processing module. For example, in some cases, the processing module 612 of MPM 610 may be paired with a single, dedicated memory bank 600. In other cases, the processing module 612 of MPM 610 may be paired with two or more dedicated memory banks 600, four or more dedicated memory banks 600, etc. Various MPMs 610, including those formed together on a common substrate or chip, may include different numbers of memory banks relative to one another. In some cases, an MPM 610 may include one memory bank 600. In other cases, an MPM may include two, four, eight, sixteen, or more memory banks 600. As a result, the number of memory banks 600 per processing module 612 may be the same throughout an entire MPM 610 or across MPMs. One or more MPMs 610 may be included in a chip. In a non-limiting example, included in an XRAM chip 624. Alternatively, at least one processing module 612 may control more memory banks 600 than another processing module 612 included within an MPM 610 or within an alternative or larger structure, such as the XRAM chip 624.

Each MPM 610 may include one processing module 612 or more than one processing module 610. In the example of Fig. 6, one processing module 612 is associated with four dedicated memory banks 600. In other cases, however, one or more memory banks of an MPM may be associated with two or more processing modules 612.

Each memory bank 600 may be configured with any suitable number of memory arrays 602. In some cases, a bank 600 may include only a single array. In other cases, a bank 600 may include two or more memory arrays 602, four or more memory arrays 602, etc. Each of the banks 600 may have the same number of memory arrays 602. Alternatively, different banks 600 may have different numbers of memory arrays 602.

Various numbers of MPMs 610 may be formed together on a single hardware chip. In some cases, a hardware chip may include just one MPM 610. In other cases, however, a single hardware chip may include two, four, eight, sixteen, 32, 64, etc. MPMs 610. In the particular non-limiting example represented in the current figure, 64 MPMs 610 are combined together on a common substrate of a hardware chip to provide the XRAM chip 624, which may also be referred to as a memory processing chip or a computational memory chip. In some embodiments, each MPM 610 may include a slave controller 613 (e.g., an extreme / Xele or XSC slave controller (SC)) configured to communicate with a DDR controller 608 (e.g., via MPM slave controller 623), and/or a master controller 622. Alternately, fewer than all of the MPMs onboard an XRAM chip 624 may include a slave controller 613. In some cases, multiple MPMs (e.g., 64 MPMs) 610 may share a single slave controller 613 disposed on XRAM chip 624. Slave controller 613 can communicate data, commands, information, etc. to one or more processing modules 612 on XRAM chip 624 to cause various operations to be performed by the one or more processing modules 612.

One or more XRAM chips 624, which may include a plurality of XRAM chips 624, such as sixteen XRAM chips 624, may be configured together to provide a dual in-line memory module (DIMM) 626. Traditional DIMMs may be referred to as a RAM stick, which may include eight or nine, etc., dynamic random-access memory chips (integrated circuits) constructed as/on a printed circuit board (PCB) and having a 64-bit data path. In contrast to traditional memory, the disclosed memory processing modules 610 include at least one computational component (e.g., processing module 612) coupled with local memory elements (e.g., memory banks 600). As multiple MPMs may be included on an XRAM chip 624, each XRAM chip 624 may include a plurality of processing modules 612 spatially distributed among associated memory banks 600. To acknowledge the inclusion of computational capabilities (together with memory) within the XRAM chip 624, each DIMM 626 including one or more XRAM chips (e.g., sixteen XRAM chips, as in the Fig. 6 example) on a single PCB may be referred to as an XDIMM (or eXtremeDIMM or XeleDIMM). Each XDIMM 626 may include any number of XRAM chips 624, and each XDIMM 624 may have the same or a different number of XRAM chips 624 as other XDIMMs 626. In the Fig. 6 example, each XDIMM 626 includes sixteen XRAM chips 624.

As shown in Fig. 6, the architecture may further include one or more memory processing units, such as an intense memory processing unit (IMPU) 628. Each IMPU 628 may include one or more XDIMMs 626. In the Fig. 6 example, each IMPU 628 includes four XDIMMs 626. In other cases, each IMPU 628 may include the same or a different number of XDIMMs as other IMPUs. The one or more XDIMMs included in IMPU 628 can be packaged together with or otherwise integrated with one or more DDR controllers 608 and/or one or more master controllers 622. For example, in some cases, each XDIMM included in IMPU 628 may include a dedicated DDR controller 608 and/or a dedicated master controller 622. In other cases, multiple XDIMMs included in IMPU 628 may share a DDR controller 608 and/or a master controller 622. In one particular example, IMPU 628 includes four XDIMMs 626 along with four master controllers 622 (each master controller 622 including a DDR controller 608), where each of the master controllers 622 is configured to control one associated XDIMM 626, including the MPMs 610 of the XRAM chips 624 included in the associated XDIMM 626.

The DDR controller 608 and the master controller 622 are examples of controllers in a controller domain 630. A higher-level domain 632 may contain one or more additional devices, user applications, host computers, other devices, protocol layer entities, and the like. The controller domain 630 and related features are described in the sections below. In a case where multiple controllers and/or multiple levels of controllers are used, the controller domain 630 may serve as at least a portion of a multi-layered module domain, which is also further described in the sections below.

In the architecture represented by Fig. 6, one or more IMPUs 628 may be used to provide a memory appliance 640, which may be referred to as an XIPHOS appliance. In the example of Fig. 6, memory appliance 640 includes four IMPUs 628.

The location of processing elements 612 among memory banks 600 within the XRAM chips 624 (which are incorporated into XDIMMs 626 that are incorporated into IMPUs 628 that are incorporated into memory appliance 640) may significantly relieve the bottlenecks associated with CPUs, GPUs, and other processors that operate using a shared memory. For example, a processor subunit 612 may be tasked to perform a series of instructions using data stored in memory banks 600. The proximity of the processing subunit 612 to the memory banks 600 can significantly reduce the time required to perform the prescribed instructions using the relevant data.

As shown in FIG. 7, a host 710 may provide instructions, data, and/or other input to memory appliance 640 and read output from the same. Rather than requiring the host to access a shared memory and perform calculations/functions relative to data retrieved from the shared memory, in the disclosed embodiments, the memory appliance 640 can perform the processing associated with a received input from host 710 within the memory appliance (e.g., within processing modules 612 of one or more MPMs 610 of one or more XRAM chips 624 of one or more XDIMMs 626 of one or more IMPUs). Such functionality is made possible by the distribution of processing modules 612 among and on the same hardware chips as the memory banks 600 where relevant data needed to perform various calculations/functions/etc. is stored.

The architecture described in Fig. 6 may be configured for execution of code. For example, each processor subunit 612 may individually execute code (defining a set of instructions) apart from other processor subunits in an XRAM chip 624 within memory appliance 640. Accordingly, rather than relying on an operating system to manage multithreading or using multitasking (which is concurrency rather than parallelism), the XRAM chips of the present disclosure may allow for processor subunits to operate fully in parallel. In addition to a fully parallel implementation, at least some of the instructions assigned to each processor subunit may be overlapping. For example, a plurality of processor subunits 612 on an XRAM chip 624 (or within an XDIMM 626 or IMPU 628) may execute overlapping instructions as, for example, an implementation of an operating system or other management software, while executing non-overlapping instructions in order to perform parallel tasks within the context of the operating system or other management software.

For purposes of various structures discussed in this description, the Joint Electron Device Engineering Council (JEDEC) Standard No. 79-4C defines the DDR4 SDRAM specification, including features, functionalities, AC and DC characteristics, packages, and ball/signal assignments. The latest version at the time of this application is January 2020, available from JEDEC Solid State Technology Association, 3103 North 10th Street, Suite 240 South, Arlington, VA 22201-2107, www.jedec.org, and is incorporated by reference in its entirety herein.

Exemplary elements such as XRAM, XDIMM, XSC, and IMPU are available from NeuroBlade Ltd., Tel Aviv, Israel. Details of memory processing modules and related technologies can be found in PCT/IB2018/000995 filed 30-July-2018, PCT/IB2019/001005 filed 6-September-2019, PCT/IB2020/000665 filed 13-August-2020, and PCT/US2021/055472 filed 18-October-2021. Exemplary implementations using XRAM, XDIMM, XSC, IMPU, etc. elements are not limiting, and based on this description one skilled in the art will be able to design and implement configurations for a variety of applications using alternative elements.

Data Analytics Processor

FIG. 8 is an example of implementations of processing systems and, in particular, processing systems for data analytics. Many modern applications are limited by data communication 820 between storage 800 and processing (shown as general -purpose compute 810). Current solutions include adding levels of data cache and re-layout of hardware components. For example, current solutions for data analytics applications have limitations including: (1) Network bandwidth (BW) between storage and processing, (2) network bandwidth between CPUs, (3) memory size of CPUs, (4) inefficient data processing methods, and (5) access rate to CPU memory.

In addition, data analytics solutions have significant challenges in scaling up. For example, when trying to add more processing power or memory, more processing nodes are required, therefore more network bandwidth between processors and between processors and storage is required, leading to network congestion.

FIG. 9 is an example of a high-level architecture for a data analytics accelerator. A data analytics accelerator 900 is configured between an external data storage 920 and an analytics engine (AE) 910 optionally followed by completion processing 912, for example, on the analytics engine 910. The external data storage 920 may be deployed external to the data analytics accelerator 900, with access via an external computer network. The analytics engine (AE) 910 may be deployed on a general-purpose computer and may include client data storage 911. The accelerator may include a software layer 902, a hardware layer 904, a storage layer 906, and networking (not shown). Each layer may include modules such as software modules 922, hardware modules 924, and storage modules 926. The layers and modules are connected within, between, and external to each of the layers. Acceleration may be done at least in part by applying one or more innovative operations, data reduction, and partial processing operations between the external data storage 920 and the analytics engine 910 (or general-purpose compute 810). Implementations may include, but are not limited to, features such as in-line, high parallelism computation, and data reduction. In an alternative operation, (only) a portion of data is processed by the data analytics accelerator 900 and a portion of the data bypasses the data analytics accelerator 900.

The data analytics accelerator 900 may provide at least in part a streaming processor, and is particularly suited, but not limited to, accelerating data analytics. The data analytics accelerator 900 may drastically reduce (for example, by several orders of magnitude) the amount of data which is transferred over the network to the analytics engine 910 (and/or the general- purpose compute 810), reduces the workload of the CPU, and reduces the required memory which the CPU needs to use. The accelerator 900 may include one or more data analytics processing engines which are tailor-made for data analytics tasks, such as scan, join, filter, aggregate etc., doing these tasks much more efficiently than analytics engine 910 (and/or the general-purpose compute 810).

An implementation of the data analytics accelerator 900 is the Hardware Enhanced Query System (HEQS™), which may include a Xiphos Data Analytics Accelerator (available from NeuroBlade Ltd., Tel Aviv, Israel). The data analytics accelerator 900 may be implemented as a portion of a data analytics acceleration layer (DAXL™), for example as a portion of an SQL Processing Unit (SPU ™). DAXL may include a suite of APIs, a development kit, and software designed to facilitate the use of the SPU across all software layers.

FIG. 10 is an example of the software layer for the data analytics accelerator. The software layer 902 may include, but is not limited to, two main components: a software development kit (SDK) 1000 and embedded software 1010. The SDK provides abstraction of the accelerator capabilities through well-defined and easy to use data-analytics oriented software APIs for the data analytics accelerator. A feature of the SDK is enabling users of the data analytics accelerator to maintain the users’ own DBMS, while adding the data analytics accelerator capabilities, for example, as part of the users’ DBMS’s planner optimization. The SDK may include modules such as:

A run-time environment 1002 may expose hardware capabilities to above layers. The run-time environment may manage the programming, execution, synchronization, and monitoring of underlying hardware engines and processing elements.

A Fast Data I/O providing an efficient API 1004 for injection of data into the data analytics accelerator hardware and storage layers, such as an NVMe array and memories, and for interaction with the data. The Fast Data I/O may also be responsible for forwarding data from the data analytics accelerator to another device (such as the analytics engine 910, an external host, or server) for processing and/or completion processing 912.

A manager 1006 (data analytics accelerator manager) may handle administration of the data analytics accelerator.

A toolchain may include development tools 1008, for example, to help developers enhance the performance of the data analytics accelerator, eliminate bottlenecks, and optimize query execution. The toolchain may include a simulator and profiler, as well as a LLVM compiler.

Embedded software component 1010 may include code running on the data analytics accelerator itself. Embedded software component 1010 may include firmware 1012 that controls the operation of the accelerator’s various components, as well as real-time software 1014 that runs on the processing elements. At least a portion of the embedded software component code may be generated, such as auto generated, by the (data analytics accelerator) SDK.

FIG. 11 is an example of the hardware layer for the data analytics accelerator. The hardware layer 904 includes one or more acceleration units 1100. Each acceleration unit 1100 includes one or more of a variety of elements (modules), such as: a decompression module 1120, a decoding module 1122, a selector module 1102, a dictionary module 1124, a fitter module 1126, a filter and projection module (FPE) 1103, a Join-and-group-by (JaGB) module 1108, and bridges 1110. Each module may contain one or more sub-modules, for example, the FPE 1103 may include a string engine (SE) 1104 and a filtering and aggregation engine (FAE) 1106.

In the current figure, a plurality of acceleration units 1100 are shown as first acceleration unit 1100-1 to Nth acceleration unit 1100-N. In the context of this description, the element number suffix “-N”, where “N” is an integer, generally refers to an exemplary one of the elements, and the element number without a suffix refers to the element in general or the group of elements. One or more acceleration units 1100, individually or in combination, may be implemented using one or more individual or combination of field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), printed circuit boards (PCBs), and similar. Acceleration units 1100 may have the same or similar hardware configurations. However, this is not limiting, and modules may vary from one to another of the acceleration units 1100.

An exemplary element (module) configuration will be used in this description. As noted above, element configuration may vary, and one or more elements may or may not be used. Similarly, an exemplary configuration of networking and communication will be used. However, alternative, and additional connections between elements, feed forward, and feedback data may be used. Communication between elements may be done via one or more of the bridges 1110 and/or via alternative communication buses, channels, and the like. Similarly, communication between one or more of the elements and elements external to the element’s acceleration module may be done via one or more of the bridges 1110. Input and output from elements may include data and alternatively or additionally includes signaling and similar information.

Refer also to FIG. 12, described below in more detail. The decompression module 1120 may be configured to receive input from any of the acceleration elements, such as, for example, from the bridges 1110. Input data (compressed) from local data storage 1208 may be input via storage bridge 1112, or data from accelerator memory 1200 input via memory bridge 1114, and decompressed by the decompression module 1120. Input data may include Parquet files (data in the Parquet file format, a columnar storage format). Decompression may include restoring compressed data to the compressed data’s original, uncompressed form. The decompression module 1120 may be configured to output to any of the acceleration unit elements, for example, to the decoding module 1122.

The decoding module 1122 may be configured to receive input from any of the acceleration elements, such as, for example, from the decompression module 1120. The decoding module 1122 may apply one or more decoding functions to at least a portion of the input data to generate decoded data. For example, commercial databases may have data formats specific to the database, or data may be in files with formats specific to the file types. The decoding module 1122 may be configured to output to any of the acceleration unit elements, for example, to the selector module 1102.

The selector module 1102 may be configured to receive input from any of the acceleration elements, such as, for example, the bridges 1110 and the Join-and-group-by engine (JaGB module) 1108 (shown in the current figure), and optionally/alternatively/in addition from the filtering and projection module (FPE) 1103, the string engine (SE) 1104, and the filtering and aggregation engine (FAE) 1106 (for clarity, not shown in the current figure). Note that in a case of feedback to the selector module 1102, the (data being) feedback may not be the processed data, but may be data other than the processed data, such as selection indicators. The selector module 1102 may be configured to select at least a subset of data from the input as processed data output. The selection may be based on one or more selection criteria, for example, a selection indicator provided by the join-and-group-by module 1108. Similarly, the selector module 1102 may be configured to output to any of the acceleration elements, such as, for example, to the dictionary module 1124.

The dictionary module 1124 may be configured to receive input from any of the acceleration elements, such as, for example, from the selector module 1102 and the bridges 1110. The dictionary module 1124 may replace input data with corresponding data for output. For example, input data may be encoded in a minimal format (i.e., 6 bits for a US state) and a dictionary used to lookup and output a desired string (i.e., full name of the US state). The dictionary module 1124 may be configured to output to any of the acceleration unit elements, for example, to the fitter module 1126 and to the bridges 1110.

The fitter module 1126 may be configured to receive input from any of the acceleration elements, such as, for example, from the dictionary module 1124 and the bridges 1110. The fitter module 1126 may change the size (fit) of input data to output data conforming to a desired size. The fitter module 1126 may also change the format of input data to be a desired format for output and subsequent processing. For example, Boolean data stored on disk might be a single bit, but downstream modules are configured for processing Boolean values as 8-bits of data. The fitter module 1126 may be configured to output to any of the acceleration unit elements, for example, to the FPE 1103.

The filter and project module (FPE 1103) may include a variety of elements (subelements) such as the string engine (SSE 1104) and the filtering and aggregation engine (FAE 1106). Input and output from the FPE 1103 may be to the FPE 1103 for distribution to subelements, or directly to and from one or more of the sub-elements. The FPE 1103 may be configured to receive input from any of the other acceleration elements, such as, for example, from the fitter module 1126 and the bridges 1110. The FPE 1103 may include functions such as filtering input data and performing projections of data. Filtering may be based on values, be of rows, columns, and the like. Projection may include , for example, processing one or more columns and creating one or more new columns based on the processed columns of data. The FPE 1103 input data may be communicated to one or more sub-elements performing one or more functions, for example, to the string engine 1104 and FAE 1106. The string engine 1104 may perform functions such as searches and filtering based on a given set of parameters, for example based on strings of characters, patterns, and/or portions of string patterns. The FAE 1106 may perform functions such as aggregation, for example summing values from a designated column of data. Similarly, the FPE 1103 may be configured to output from any of the sub-elements to any of the acceleration elements, such as, for example, to the JaGB 1108.

The Join-and-group-by (JaGB) engine 1108 may be configured to receive input from any of the acceleration elements, such as, for example, from the FPE 1103 and the bridges 1110. The JaGB module may implement key-value (KV) functions, such as lookups, as well as functions such as joins. The JaGB 1108 may be configured to output to any of the acceleration unit elements, for example, to the selector module 1102 and the bridges 1110.

FIG. 12 is an example of the storage layer and bridges for the data analytics accelerator. The storage layer 906 may include one or more types of storage deployed locally, remotely, or distributed within and/or external to one or more of the acceleration units 1100 and one or more of the data analytics accelerators 900. The storage layer 906 may include non-volatile memory (such as local data storage 1208) and volatile memory (such as an accelerator memory 1200) deployed local to the hardware layer 904. Non-limiting examples of the local data storage 1208 include, but are not limited to solid state drives (SSD) deployed local and internal to the data analytics accelerator 900. Non-limiting examples of the accelerator memory 1200 include, but are not limited to FPGA memory (for example, of the hardware layer 904 implementation of the acceleration unit 1100 using an FPGA), processing in memory (PIM) 1202 memory for example, banks 600 of memory 602 in a memory processing module 610, and SRAM, DRAM, and HBM (for example, deployed on a PCB with the acceleration unit 1100). The storage layer 906 may also use and/or distribute memory and data via the bridges 1110 (such as, for example, the memory bridge 1114) via a fabric 1306 (described below in reference to FIG. 13), for example, to other acceleration units 1100 and/or other acceleration processors 900. In some embodiments, storage elements may be implemented by one or more elements or sub-elements.

One or more bridges 1110 provide interfaces to and from the hardware layer 904. Each of the bridges 1110 may send and/or receive data directly or indirectly to/from elements of the acceleration unit 1100. Bridges 1110 may include storage 1112, memory 1114, fabric 1116, and compute 1118.

Bridges configuration may include the storage bridge 1112 interfaces with the local data storage 1208. The memory bridge interfaces with memory elements, for example the PIM 1202, SRAM 1204, and DRAM / HBM 1206. The fabric bridge 116 interfaces with the fabric 1306. The compute bridge 1118 may interface with the external data storage 920, the analytics engine 910 and the client data storage 911. For example, data in the client data storage 911, such as RAM on a client computer, implementing such as the analytics engine 910, may be transferred (pushed or pulled) via the bridges 1110, such as the compute bridge 1118 into the acceleration unit 1100 streaming data processing. For example, being transferred as input to the decompression module 1120 or the selector module 1102.

A data input bridge (not shown) may be configured to receive input from any of the other acceleration elements, including from other bridges, and to output to any of the acceleration unit elements, such as, for example, to the selector module 1102.

FIG. 13 is an example of networking for the data analytics accelerator. An interconnect 1300 may include an element deployed within each of the acceleration units 1100. The interconnect 1300 may be operationally connected to elements within the acceleration unit 1100, providing communications within the acceleration unit 1100 between elements. In FIG. 13, exemplary elements (1102, 1104, 1106, 1108, 1110) are shown connected to the interconnect 1300. The interconnect 1300 may be implemented using one or more sub-connection systems using one or more of a variety of networking connections and protocols between two or more of the elements, including, but not limited to, dedicated circuits and PCI switching. The interconnect 1300 may facilitate alternative and additional connections feed forward, and feedback between elements, including but not limited to looping, multi-pass processing, and bypassing one or more elements. The interconnect can be configured for communication of data, signaling, and other information.

Bridges 1110 may be deployed and configured to provide connectivity from the acceleration unit 1100-1 (from the interconnect 1300) to external layers and elements. For example, connectivity may be provided as described above via the memory bridge 1114 with the storage layer 906, via the fabric bridge 1116 with the fabric 1306, and via the compute bridge 1118 with the external data storage 920 and the analytics engine 910. Other bridges (not shown) may include NVME, PCIe, high-speed, low-speed, high-bandwidth, low-bandwidth, and so forth. The fabric 1306 may provide connectivity internal to the data analytics accelerator 900-1 and, for example, between layers like hardware 904 and storage 906, and between acceleration units, for example between a first acceleration unit 1100-1 to additional acceleration units 1100- N. The fabric 1306 may also provide external connectivity from the data analytics accelerator 900, for example between the first data analytics accelerator 900-1 to additional data analytics accelerators 900-N.

The data analytics accelerator 900 may use a columnar data structure. The columnar data structure can be provided as input and received as output from elements of the data analytics accelerator 900. In particular, elements of the acceleration units 1100 can be configured to receive input data in the columnar data structure format and generate output data in the columnar data structure format. For example, the selector module 1102 may generate output data in the columnar data structure format that is input by the FPE 1103. Similarly, the interconnect 1300 may receive and transfer columnar data between elements, and the fabric 1306 between acceleration units 1100 and accelerators 900.

Streaming processing includes a data flow path from data storage (such as the external data storage 920 or accelerator memory 1200) through the system, and being output (such as to the external data storage 920, accelerator memory 1200, or analytics engine 910). Streaming processing may exclude data returning to a previous processing element, that is, without recycling of processed data. In streaming processing, data may be processed once, and only once, by each element. Metadata, indicators, and the like may be fed forward or fed backward.

Streaming processing may avoid memory bounded operations which can limit communication bandwidth of memory mapped systems. Streaming processing may exclude the use of addressable memory. Buffering may be used for input data, during processing, and for output data of elements (modules, processors, engines). Buffering may use non-addressable memory. For example, data output from the decompression module 1120 may be buffered for input to the decoding module 1122. In a case where the acceleration unit 1100 is implemented using an FPGA, buffering may be implemented using FPGA memory. Alternatively and/or in addition, buffering may be implemented using the accelerator memory 1200.

The accelerator processing may include techniques such as columnar processing, that is, processing data while in columnar format to improve processing efficiency and reduce context switching as compared to row-based processing. The accelerator processing may also include techniques such as single instruction multiple data (SIMD) to apply the same processing on multiple data elements, increasing processing speed, facilitating "real-time” or “line-speed” processing of data. The fabric 1306 may facilitate large scale systems implementation.

Accelerator memory 1200, such as PIM 1202 and HBM 1206 may provide support for high bandwidth random access to memory. Partial processing may produce data output from the data analytics accelerator 900 that may be orders of magnitude less than the original data from storage 920. Thus, facilitating the completion of processing on analytics engine 910 or general- purpose compute with a significantly reduced data scale. Thus, computer performance is improved, for example, increasing processing speeds, decreasing latency, decreasing variation of latency, and reducing power consumption.

Consistent with the examples described in this disclosure, in an embodiment, a system includes a hardware based, programmable data analytics processor 900 configured to reside between a data storage unit 920 and one or more hosts (host processors) 910, wherein the programmable data analytics processor includes: a decompression module 1120 configured to input a first set of data, the first set of data being compressed data, restore the compressed data to the compressed data’s uncompressed data form, and output the uncompressed data; a decoding module 1122 configured to input the uncompressed data, apply one or more decoding functions to at least a portion of the uncompressed data to generate decoded data as a first set of data; a selector module 1102 configured to input the decoded data, the decoded data being based on the first set of data and, based on a selection indicator, output a first subset of the first set of data; a dictionary module 1124 configured to input the first subset of data, replace at least a portion of the first subset of data with corresponding data, and output lookup data; a fitter module 1126 configured to input the lookup data, change the size of at least a portion of the lookup data to desired sizes, and output fitted data; a filter and project module 1103 configured to input a second set of data and, based on a function, output an updated second set of data; a join and group module 1108 configured to combine data from one or more third data sets into a combined data set; and a communications fabric 1306 configured to transfer data between any of the decompression module, the decoding module, selector module, the dictionary module, the fitter module, the filter and project module, and the join and group module.

The modules may correspond to the modules discussed above in connection with, for example, FIGS. 8-13.

In some embodiments, the first set of data has a columnar structure. For example, the first set of data may include one or more data tables. In some embodiments, the second set of data has a columnar structure. For example, the second set of data may include one or more data tables. In some embodiments, the one or more third data sets have a columnar structure. For example, the one or more data sets may include one or more data tables.

In some embodiments, the programmable data analytics processor is configured to input data in parallel.

In some embodiments, the first set of data is input as a block of parallel data.

In some embodiments, the second set of data includes the first subset. In some embodiments, the one or more third data sets include the updated second set of data. In some embodiments, the first subset includes a number of values equal to or less than the number of values in the first set of data. In some embodiments, the one more third data sets include structured data. For example, the structured data may include table data in column and row format. In some embodiments, the one or more third data sets include one or more tables, and the combined data set includes at least one table based on combining columns from the one or more tables. In some embodiments, the one or more third data sets include one or more tables, and the combined data set includes at least one table based on combining rows from the one or more tables.

In some embodiments, the selection indicator is based on a previous filter value. In some embodiments, the selection indicator may specify a memory address associated with at least a portion of the first set of data. In some embodiments, the selector module is configured to input the decoded data, or data based on the first set of data as a block of data in parallel. The selector module 1102 may output the updated second set of data based on a function including using single instruction multiple data (SIMD) processing of the block of data to generate the first subset.

In some embodiments, the filter and project module 1103 includes at least one function configured to modify the second set of data. In some embodiments, the filter and projection module is configured to input the second set of data as a block of data in parallel and execute a SIMD processing function of the block of data to generate the second set of data.

In some embodiments, the join and group module 1108 is configured to combine columns from one or more tables. In some embodiments, the join and group module is configured to combine rows from one or more tables. In some embodiments, the modules are configured for line rate processing.

In some embodiments, the communications fabric 1306 is configured to transfer data by streaming the data between modules. Streaming (or stream processing or distributed stream processing) of data may facilitate parallel processing of data transferred to/from any of the modules discussed herein.

In some embodiments, the programmable data analytics processor 900 is configured to perform at least one of SIMD processing, context switching, and streaming processing. Context switching may include switching from one thread to another thread and may include storing the context of the current thread and restoring the context of another thread.

In an embodiment, a hardware based, programmable data analytics processor 900 is configured to reside between a data storage unit 920 and one or more host processors 910, wherein the programmable data analytics processor includes: a selector module 1102 configured to input data based on a first set of data and, based on a selection indicator output a first subset of the first set of data, a filter and project module 1103 configured to input a second set of data and, based on a function, output an updated second set of data, and a join and group module 1108 configured to combine data from one or more third data sets into a combined data set.

The programmable data analytics processor further includes one or more modules selected from the group consisting of: a decompression module 1120 configured to input the first set of data, the first set of data being compressed data, restore the compressed data to the compressed data’s uncompressed data form, and output the uncompressed data, a decoding module 1122 configured to input the uncompressed data, apply one or more decoding functions to at least a portion of the uncompressed data to generate decoded data, and a dictionary module 1124 configured to input the first subset of data, replace at least a portion of the first subset of data with corresponding data, and output lookup data, and a fitter module 1126 configured to input the lookup data, change the size of at least a portion of the lookup data to desired sizes, and output fitted data.

The programmable data analytics processor further includes a communications fabric 1306 configured to transfer data between any of the modules.

Note that a variety of implementations for modules and processing are possible, depending on the application. Modules are preferably implemented in software, but can also be implemented in hardware and firmware, on a single processor or distributed processors, at one or more locations. The above-described module functions can be combined and implemented as fewer modules or separated into sub-functions and implemented as a larger number of modules. Based on the above description, one skilled in the art will be able to design an implementation for a specific application.

Note that the above-described examples, numbers used, and exemplary calculations are to assist in the description of this embodiment. Inadvertent typographical errors, mathematical errors, and/or the use of simplified calculations do not detract from the utility and basic advantages of the disclosed embodiments.

To the extent that the appended claims have been drafted without multiple dependencies, this has been done only to accommodate formal requirements in jurisdictions that do not allow such multiple dependencies. Note that all possible combinations of features that would be implied by rendering the claims multiply dependent are explicitly envisaged and should be considered part of the disclosed embodiments. The descriptions of the various embodiments of the present disclosure have been presented for purposes of illustration but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

As used herein, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise.

The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

It is appreciated that certain features of the disclosed embodiments, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the disclosed embodiments, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub combination or as suitable in any other described embodiment. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Claims

1. A system, comprising: a hardware based, programmable data analytics processor configured to reside between a data storage unit and one or more host processors, wherein the programmable data analytics processor includes: a decompression module configured to input a first set of data, the first set of data being compressed data, restore the compressed data to the compressed data’s uncompressed data form, and output the uncompressed data; a decoding module configured to input the uncompressed data, apply one or more decoding functions to at least a portion of the uncompressed data to generate decoded data as a first set of data; a selector module configured to input the decoded data, the decoded data being based on the first set of data and, based on a selection indicator output a first subset of the first set of data; a dictionary module configured to input the first subset of data, replace at least a portion of the first subset of data with corresponding data, and output lookup data; a fitter module configured to input the lookup data, change the size of at least a portion of the lookup data to desired sizes, and output fitted data; a filter and project module configured to input a second set of data and, based on a function, output an updated second set of data; a join and group module configured to combine data from one or more third data sets into a combined data set; and a communications fabric configured to transfer data between any of the decompression module, the decoding module, selector module, the dictionary module, the fitter module, the filter and project module, and the join and group module.

2. The system of claim 1, wherein the second set of data includes the first subset.

3. The system of claim 1, wherein the one or more third data sets include the updated second set of data.

4. The system of claim 1, wherein the selection indicator is based on a previous filter value.

5. The system of claim 1, wherein the selection indicator specifies a memory address associated with at least a portion of the first set of data.

6. The system of claim 1, wherein the first subset includes a number of values equal to or less than the number of values in the first set of data.

7. The system of claim 1, wherein the one more third data sets include structured data.

8. The system of claim 7, wherein the structured data includes table data in column and row format.

9. The system of claim 7, wherein the one or more third data sets include one or more tables and the combined data set includes at least one table based on combining columns from the one or more tables.

10. The system of claim 1, wherein the one or more third data sets include one or more tables, and the combined data set includes at least one table based on combining rows from the one or more tables.

11. The system of claim 1, wherein the filter and project module comprises at least one function configured to modify the second set of data.

12. The system of claim 1, wherein the join and group module is configured to combine columns from one or more tables.

13. The system of claim 1, wherein the join and group module is configured to combine rows from one or more tables.

14. The system of claim 1, wherein the first set of data has a columnar structure.

15. The system of claim 1, wherein the second set of data has a columnar structure.

16. The system of claim 1, wherein the one or more third data sets have a columnar structure.

17. The system of claim 1, wherein the programmable data analytics processor is configured to perform single instruction multiple data (SIMD) processing.

18. The system of claim 1, wherein the programmable data analytics processor is configured to input data in parallel.

19. The system of claim 1, wherein the first set of data is input as a block of parallel data.

20. The system of claim 1 , wherein the selector module is configured to input the decoded data as a block of data in parallel and use single instruction multiple data (SIMD) processing of the block of data to generate the first subset.

21. The system of claim 1, wherein the filter and projection module is configured to input the second set of data as a block of data in parallel and wherein outputting the updated second set of data based on the function comprises executing a single instruction multiple data (SIMD) processing function of the block of data to generate the second set of data.

22. The system of claim 1, wherein the programmable data analytics processor is configured to perform streaming processing.

23. The system of claim 1, wherein the communications fabric is configured to transfer data by streaming the data between modules.

24. The system of claim 1 , wherein the modules are configured for line rate processing.

25. The system of claim 1, wherein the programmable data analytics processor is configured for context switching

26. A system, comprising: a hardware based, programmable data analytics processor configured to reside between a data storage unit and one or more host processors, wherein the programmable data analytics processor includes: a selector module configured to input data based on a first set of data and, based on a selection indicator output a first subset of the first set of data, a filter and project module configured to input a second set of data and, based on a function, output an updated second set of data, and a join and group module configured to combine data from one or more third data sets into a combined data set, and one or more modules selected from the group consisting of: a decompression module configured to input the first set of data, the first set of data being compressed data, restore the compressed data to the compressed data’ s uncompressed data form, and output the uncompressed data, a decoding module configured to input the uncompressed data, apply one or more decoding functions to at least a portion of the uncompressed data to generate decoded data, and a dictionary module configured to input the first subset of data, replace at least a portion of the first subset of data with corresponding data, and output lookup data, and a fitter module configured to input the lookup data, change the size of at least a portion of the lookup data to desired sizes, and output fitted data, and a communications fabric configured to transfer data between any of the modules.