Amazon Neptune

Neptune is a purpose-built, high-performance graph database engine, that is optimized for storing billions of relationships and querying the graph with milliseconds latency. Neptune supports the popular graph query languages such  as Apache TinkerPop Gremlin and W3C’s SPARQL, which enable customers to build queries that efficiently navigate highly connected data sets. Some of Neptune power’s graph use cases are recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security.

  • Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across Availability Zones. 
  • Neptune provides data security features, with support for encryption at rest and in transit. Neptune is fully managed, which simply means hardware provisioning, software patching, setup, configuration, or backups will be done by Neptune engines

Neptune Features

Customers can launch a database instance and connect their  application within minutes without additional configuration. Database Parameter Groups provide granular control and fine-tuning their database.

  • Neptune provides Amazon CloudWatch metrics for customers database instances. So that they can use the AWS Management Console to view over 20 key operational metrics for their database instances, including compute, memory, storage, query throughput, and active connections.
  • Customers have the ability to control if and when their  instance is patched via Database Engine Version Management. Neptune engines can notify customers via email or SMS of important database events like automated failover. 
  • Amazon Neptune supports quick, efficient cloning operations, where entire multi-terabyte database clusters can be cloned in minutes. Cloning is useful for a number of purposes including application development, testing, database updates, and running analytical queries.

Amazon Neptune enables the Property Graph model using the open source Apache TinkerPop Gremlin traversal language and provides a Gremlin Websockets server that supports TinkerPop version 3.3.

  • Using Neptune, AWS customers can quickly build fast Gremlin traversals over property graphs. Existing Gremlin applications can easily use Neptune by changing the Gremlin service configuration to point to a Neptune instance.
  • Neptune Supports W3C’s Resource Description Framework (RDF) and SPARQL. RDF is popular because it provides flexibility for modeling complex information domains. 

Neptune uses graph structures such as nodes (data entities), edges (relationships), and properties to represent and store data. The relationships are stored as first-order citizens of the data model. This condition allows data in nodes to be directly linked, dramatically improving the performance of queries that navigate relationships in the data. The interactive performance at scale in Neptune effectively enables a broad set of graph use cases.

  • A graph in a graph database can be traversed along specific edge types, or across the entire graph.
  • Graph databases can represent how entities relate by using actions, ownership, parentage, and so on

Amazon Neptune helps you build applications that store and navigate information in the life sciences, and process sensitive data easily using encryption at rest. For example, using Neptune customers can store models of disease and gene interactions; search for graph patterns within protein pathways to find other genes that might be associated with a disease.

  • Neptune helps integrate information to tackle challenges in healthcare and life sciences research. Using Neptune creating and storing patient relationships from medical records across different systems is seamless. It also enables topically organize research publications to find relevant information quickly.

With Amazon Neptune, clients can store relationships between information categories such as customer interests, friends, and purchase history in a graph. They can then quickly query it to make recommendations that are personalized and relevant.

  • Using a highly available graph database, making product recommendations to a user based on which products are purchased by others who follow the same sport and have similar purchase history. Or, identify people who have a friend in common, but don’t yet know each other, and make a friendship recommendation.

Customers can scale the compute and memory resources powering the production cluster up or down by creating new replica instances of the desired size, or by removing instances. Compute scaling operations typically complete in a few minutes.

  • Amazon Neptune will automatically grow the size of the database volume as their database storage needs grow. The volume can grow in increments of 10 GB up to a maximum of 64 TB. 
  • Amazon Neptune replicas increase read throughput to support high volume application requests by creating up to 15 database read replicas. By avoiding the need to perform writes at the replica nodes, it frees up more processing power to serve read requests and reduces the replica lag time 
  • often down to single digit milliseconds.

Amazon Neptune allows fast, parallel bulk loading for Property Graph data that is stored in S3. They also can use a REST interface to specify the S3 location for the data. It uses a CSV delimited format to load data into the Nodes and Edges. 

  • RDF Bulk Loading:- Amazon Neptune enables fast, parallel bulk loading for RDF data that is stored in S3. Customers can use a REST interface to specify the S3 location for the data. 
    • The N-Triples (NT), N-Quads (NQ), RDF/XML, and Turtle RDF 1.1 serializations are supported.

Resource Description Framework (RDF) provides flexibility for modeling complex information domains. There are a number of existing free or public datasets available in RDF including Wikidata and PubChem, a database of chemical molecules. 

  • Amazon Neptune enables the W3C’s Semantic Web standards of RDF  and SPARQL, and it also provides an HTTP REST endpoint that implements the SPARQL Protocol.

Graph databases are useful for connected, contextual, relationship-driven data, such as  social media data, recommendation engines, driving directions (route finding), logistics, diagnostics, and scientific data analysis in fields like neuroscience.

  • Another use case for graph databases is detecting fraud. For example, you can track credit card purchases and purchase locations to detect uncharacteristic use. Detecting fraudulent accounts is another example.

Amazon Neptune enables customers build knowledge graph applications. A knowledge graph lets them store information in a graph model and use graph queries to help your users navigate highly connected datasets more easily. Neptune supports open source and open standard APIs so that you can quickly use existing information resources to build your knowledge graphs and host them on a fully managed service.

  • For example, suppose that a user is interested in the Mona Lisa by Leonardo da Vinci. User an discover other works of art by the same artist or other works located in The Louvre. Using a knowledge graph, it is possible to add topical information to product catalogs, build and query complex models of regulatory rules, or model general information, like Wikidata.

With Amazon Neptune, you can use relationships to process financial and purchase transactions in near-real time to easily detect fraud patterns. Neptune provides a fully managed service to execute fast graph queries to detect that a potential purchaser is using the same email address and credit card as a known fraud case.

  • If you are building a retail fraud detection application, Neptune can help you build graph queries. These queries can help you easily detect relationship patterns, such as multiple people associated with a personal email address or multiple people who share the same IP address but reside in different physical addresses.

Neptune Component

The  type of instance that client specify determines the hardware of the host computer used for their instance. Each instance type offers different compute, memory, and storage capabilities and are grouped in instance families based on these capabilities. Each instance type provides higher or lower minimum performance from a shared resource.

Neptune replica

Neptune replica can Connected to the same storage volume as the primary DB instance and supports only read operations. Each Neptune DB cluster can have up to 15 Neptune Replicas in addition to the primary DB instance. This provides high availability by locating Neptune Replicas in separate Availability Zones and distribution load from reading clients.

Primary DB

Primary DB instance enables read and write operations, and performs all of the data modifications to the cluster volume. Each Neptune DB cluster has one primary DB instance that is responsible for writing (that is, loading or modifying) graph database contents

Cluster volume

Neptune data is stored in the cluster volume, which is designed for reliability and high availability. A cluster volume consists of copies of the data across multiple Availability Zones in a single AWS Region. Because your data is automatically replicated across Availability Zones, it is highly durable, and there is little possibility of data loss.

Gremlin

The Gremlin Console is a fairly standard REPL (Read Eval Print Loop) shell. It is based on the Groovy console and if you have used any of the other console environments such as those found with Scala, Python and Ruby you will feel right at home here. The Console offers a low overhead (you can set it up in seconds) and low barrier of entry way to start to play with graphs on your local computer. A Gremlin edge statement is what implies the existence of an edge between two vertices in a graph in Neptune. The subject (S) of an edge statement is the source from vertex. The predicate (P) is a user-supplied edge label. The object (O) is the target overtex. The graph (G) is a user-supplied edge identifier. The console can actually work with graphs that are running locally or remotely but for the majority of this book we will keep things simple and focus on local graphs.

  • A Gremlin property statement in Neptune asserts an individual property value for a vertex or edge. The subject is a user-supplied vertex or edge identifier.
  • The predicate is the property name (key), and the object is the individual property value.
  • The graph (G) is again the default graph identifier, the null graph, displayed as <~>.
  • A property can be represented by storing the element identifier in the S position, the property key in the P position, and the property value in the O position.

Property graph data in Amazon Neptune is composed of four-position (quad) statements. Each of these statements represents an individual atomic unit of property graph data.  Each quad is a statement that makes an assertion about one or more resources. A statement can assert the existence of a relationship between two resources, or it can attach a property (key-value pair) to a resource. One can think of the quad predicate value generally as the verb of the statement. It describes the type of relationship or property that’s being defined. The object is the target of the relationship, or the value of the property.  

  • User-facing values in a quad statement are usually stored separately in a dictionary index, where the statement indexes reference them using an 8-byte long term identifier.
  • The exception to this is numeric values, including date and datetime values (represented as milliseconds from the epoch). These can be stored inline directly in the statement indexes.

Amazon Neptune featured a Gremlin, a self-service tool for understanding the execution approach taken by the Neptune engine. Which add an explain parameter to an HTTP call that submits a Gremlin query. The explain feature provides information about the logical structure of query execution plans. It can be used to identify potential evaluation and execution bottlenecks. 

Graph Database

Amazon Neptune is Graph database, which is purpose-built to store and navigate relationships. Graph databases have advantages over relational databases for certain use cases—including social networking, recommendation engines, and fraud detection—when creating relationships between data and quickly query these relationships. There are a number of challenges when building these types of applications using a relational database. It requires multiple tables with multiple foreign keys. The SQL queries to navigate this data require nested queries and complex joins that quickly become unwieldy. Neptune uses graph structures such as nodes (data entities), edges (relationships), and properties to represent and store data. The relationships are stored as first-order citizens of the data model. 

  • This condition allows data in nodes to be directly linked, which dramatically improves the performance of queries that navigate relationships in the data. The interactive performance at scale in Neptune effectively enables a broad set of graph use cases.

 

Graph databases can represent how entities relate by using actions, ownership, parentage, and so on. Whenever connections or relationships between entities are at the core of the data a graph database is a natural choice. 

  • Graph databases are useful for modeling and querying social networks, business relationships, dependencies, shipping movements, and similar items.

Graph databases are useful for connected, contextual, relationship-driven data. Other use cases include recommendation engines, driving directions (route finding), logistics, diagnostics, and scientific data analysis in fields like neuroscience.

Recommendation Engines

Amazon Neptune enables customers to store relationships between information categories such as customer interests, friends, and purchase history in a graph. They can then quickly query it to make recommendations that are personalized and relevant. 

  • A highly available graph database to make product recommendations to a user based on which products are purchased by others who follow the same sport and have similar purchase history. Or, identify people who have a friend in common, but don’t yet know each other, and make a friendship recommendation.

Life Sciences

Amazon Neptune enables customers to build applications that store and navigate information in the life sciences, and process sensitive data easily using encryption at rest. 

  • Using Neptune, they can store models of disease and gene interactions. It can be used to search graph patterns within protein pathways to find other genes that might be associated with a disease. 
  • Neptune helps integrate information to tackle challenges in healthcare and life sciences research. it can be used to create and store patient relationships from medical records across different systems..

Fraud Detection

Using Amazon Neptune, customers can use relationships to process financial and purchase transactions in near-real time to easily detect fraud patterns. Neptune provides a fully managed service to execute fast graph queries to detect that a potential purchaser is using the same email address and credit card as a known fraud case. 

  • With Neptune, customers can build graph queries. These queries can help them detect relationship patterns, such as multiple people associated with a personal email address or multiple people who share the same IP address but reside in different physical addresses.

Knowledge Graphs

Amazon Neptune allows  customers to build knowledge graph applications. A knowledge graph lets them store information in a graph model and use graph queries to help their users navigate highly connected datasets more easily. 

  • Neptune supports open source and open standard APIs so that you can quickly use existing information resources to build knowledge graphs and host them on a fully managed service. 
  • Using a knowledge graph, customers can add topical information to product catalogs, build and query complex models of regulatory rules, or model general information, like Wikidata.

Amazon Quantum Ledger Database (QLDB)

Amazon Quantum Ledger Database (QLDB) provides a transparent, immutable, and cryptographically verifiable transaction log ‎owned by a central trusted authority. Amazon QLDB tracks each and every application data change and maintains a complete and verifiable history of changes over time. Ledgers are typically used to record a history of economic and financial activity in private or public organization. Using ledger-like functionality gives organizations an accurate history of their applications’ data.

  • Amazon QLDB is a new class of database that eliminates the need to engage in the complex development effort of building your own ledger-like applications.
  • With QLDB, your data’s change history is immutable, which cannot be altered or deleted
  • Using cryptography, customers can easily verify that there have been no unintended modifications to their application’s data. QLDB uses an immutable transactional log, known as a journal, that tracks each application data change and maintains a complete and verifiable history of changes over time.

QDLB Features

Using cryptography, customers can easily verify that there have been no unintended modifications to their application’s data. Because of that the customer data in QLDB change history is immutable, and it cannot be altered or deleted. 

  • This secure summary, commonly known as a digest, is generated using a cryptographic hash function.
  • The digest acts as a proof of customers data’s change history, allowing them to look back and verify the integrity of their data changes.

QLDB is serverless, as such it automatically scales to support the demands of clients applications. In QLDB there are no servers to manage and no read or write limits to configure.

  • Since QLDB is a database, it provides better performance and scale than blockchain frameworks. QLDB can easily scale up and execute 2-3x as many transactions as common blockchain frameworks. 
  • Blockchain frameworks are decentralized, which require peer nodes to validate a transaction before it can be stored in the ledger, impacting their performance. On the other hand, executing a transaction in QLDB is as simple as any AWS database.

Amazon QLDB’s familiar database capabilities make it easy to use. QLDB supports PartiQL – a new, open source, SQL-compatible query language designed to easily work with all data types and structures. With PartiQL, you can easily query, manage, and update your data with SQL operators.

  • QLDB’s document-oriented data model is flexible, enabling you to easily store and process both structured and semi-structured data.
  • QLDB transactions are ACID compliant and have full serializability- the highest level of isolation.

QLDB uses an immutable transactional log, known as a journal, that tracks each application data change and maintains a complete and verifiable history of changes over time.

  • The journal is append-only, which means data can only be added to a journal and it cannot be overwritten or deleted. This ensures that customer stored change history cannot be deleted or modified.
  • Amazon QLDB allows customers to access the entire change history of their  application’s data. 

Amazon QLDB enables PartiQL, which is SQL-compatible access to QLDB’s document-oriented data model that includes semi-structured and nested data while remaining independent of any particular data source. 

  • PartiQL helps customers to easily query, manage, and update their data using familiar SQL operators.
  • Amazon QLDB provides atomicity, consistency, isolation, and durability  known as ACID properties. In addition QLDB transactions have full serializability, in other word highest level of isolation. 
  • Amazon QLDB stores data using a document-oriented data model, which provides the flexibility to store structured and semi-structured data. QLDB’s data model also supports nested data structures, which can simplify any application.

Amazon QLDB is designed for high availability, replicating multiple copies of data within an Availability Zone (AZ) as well as across 3 AZs in an AWS region, without any additional cost or setup.

  • QLDB backs up your data continuously while maintaining consistent performance, allowing it to transparently recover from any instance or physical storage failures.

Cryptography

Using both SHA-256 hash function and a Merkle tree–based model, QDLP generates a cryptographic representation known as a digest. The digest acts as a unique signature of clients data’s entire change history as of a point in time. It enables them to look back and verify the integrity of their document revisions relative to that signature.

  • A digest is a cryptographic representation of your ledger’s entire journal at a point in time. A journal is append-only, and journal blocks are sequenced and hash-chained similar to blockchains.
  • A Merkle tree is a tree data structure in which each leaf node represents a hash of a data block. Each non-leaf node is a hash of its child nodes. Commonly used in blockchains, a Merkle tree enables efficient verification of large datasets with an audit proof mechanism.
  • A proof is the ordered list of node hashes that QLDB returns for a given digest and document revision. It consists of the hashes that are required by a Merkle tree model to chain the given leaf node hash (a revision) to the root hash (the digest).

Amazon QLDB is made to address the needs of high-performance online transaction processing (OLTP) workloads. QLDB has  SQL-like query capabilities, and delivers full ACID transactions. QLDB data items are documents, which deliver schema flexibility and intuitive data modeling. With a journal at the core, QLDB makes it easy to access the complete and verifiable history of all changes to any data, and to stream coherent transactions to other data services as needed.

  • Using optimistic concurrency control (OCC), concurrency control is implemented in QLDP. OCC operates on the principle that multiple transactions can frequently complete without interfering with each other.
  • Before committing to each transaction OCC performs a validation check to ensure that no other committed transaction has modified the snapshot of data that it’s accessing. If this check reveals conflicting modifications, or the state of the data snapshot changes, the committing transaction is rejected.
  • For data storage, QLDB uses an immutable transactional log known as a journal. This journal tracks every change to the data and maintains a complete and verifiable history of changes over time.

Ledger Structure

As a ledger database, QLDB differs from other document-based databases when it comes to the following key concepts. This section provides an overview of the core concepts and terminology in Amazon QLDB, including ledger structure and how a ledger manages data. 

Write Transactions

When an application needs to modify data in a document, it does so in a database transaction. Within a transaction, data is read from the ledger, updated, and committed to the journal. The journal represents a complete and immutable history of all the changes to your data.

  • QLDB writes one or more chained blocks to the journal in a transaction. Each block contains entry objects that represent the document revisions that you insert, update, and delete, along with the PartiQL statements that committed them.
  • When transactions are committed to the journal as blocks that contain document revision entries. Each block is hashed and chained to subsequent blocks for verification. Each block has a sequence number to specify its address within the strand.
    • A strand is a partition of your ledger’s journal. QLDB currently supports journals with a single strand only.

Data Storage

There are two types of data storage in QLDB: 

  • Journal storage—The disk space that is used by a ledger’s journal. The journal is append-only and contains the complete, immutable, and verifiable history of all the changes to your data.
  • Indexed storage—The disk space that is used by a ledger’s tables, indexes, and indexed history. Indexed storage consists of ledger data that is optimized for high-performance queries.

After your data is committed to the journal, it is materialized into the tables that you define. These tables enable faster and more efficient queries. When an application reads data, it accesses the tables and indexes that are stored in your indexed storage.

Ledger Structure

Fundamentally, QLDB data is organized into tables of Amazon Ion documents. More precisely, tables are collections of document revisions. A document revision represents a single iteration of the document’s full dataset. Because QLDB stores the complete change history of the data, a table contains not only the latest revision of its documents, but also all prior iterations. 

  • Document revisions used for inserting, updating, and deleting elements of a collection. 
  • The history function in QLDB is a PartiQL extension that returns revisions from the system-defined view of your table. So, it includes both your data and the associated metadata in the same schema as the committed view.
  • Querying the QLDB history function with a table ID as the first input parameter is also possible. This enables to query the history of dropped tables. After a table is dropped.

Querying Data

QLDB is intended to address the needs of high-performance online transaction processing (OLTP) workloads. A ledger provides queryable views of the customers data based on the transaction information that is committed to the journal. Similar to views in relational databases, a view in QLDB is a projection of the data in a table. Views are maintained in real time, so that they’re always available for applications to query. They can query the following views using PartiQL SELECT statements:

  • User—The latest non-deleted revision of the application-defined data only. This is the default view in QLDB.
  • Committed—The latest non-deleted revision of both the data and the system-generated metadata. This is the full system-defined table that corresponds directly to the user table.
  • Customers also can query the revision history of their data by using the built-in History Function. The history function returns both the data and the associated metadata in the same schema as the committed view.

Key Terms

Indexed storage The disk space that is used by a ledger’s tables, indexes, and indexed history. Indexed storage consists of ledger data that is optimized for high-performance queries.

Entry In object that is contained in a block. Entries represent document revisions that are inserted, updated, and deleted in a transaction, along with the PartiQL statements that committed them.

  • Each entry also has a hash value for verification. An entry hash consists of the full hash chain of every revision and statement within that entry combined with the hash of the previous chained entry.

Journal The hash-chained set of all blocks that are committed in your ledger. The journal is append-only and represents a complete and immutable history of all the changes to your ledger data.

journal storage The disk space that is used by a ledger’s journal.

journal strand A partition of a journal. QLDB currently supports journals with a single strand only.

Proof The ordered list of 256-bit hash values that QLDB returns for a given digest and document revision. It consists of the hashes that are required by a Merkle tree model to chain the given revision hash to the digest hash.

  • A proof enables you to verify the integrity of your revisions relative to the digest. For more information, see Data Verification in Amazon QLDB.

Table An unordered collection of document revisions.

View A queryable projection of the data in a table, based on transactions committed to the journal. In a PartiQL statement, a view is denoted with a prefix qualifier (starting with _ql_) for a table name.

Block An object that is committed to the journal in a transaction. A single transaction writes one or more blocks in the journal, but a block can only be associated with one transaction. A block contains entries that represent the document revisions that were committed in the transaction along with the PartiQL statements that committed them.

  • Each block also has a hash value for verification. A block hash consists of the full hash chain of every entry within that block combined with the hash of the previous chained block.

Digest A 256-bit hash value that uniquely represents your ledger’s entire history of document revisions as of a point in time. A digest hash is generated from your ledger’s full hash chain as of the latest committed block in the journal at that time.

  • QLDB enables you to generate a digest as a secure output file. Then, you can use that output file to verify the integrity of your document revisions relative to that hash.

Document A set of data in Amazon Ion struct format that can be inserted, updated, and deleted in a table. A QLDB document can have structured, semistructured, nested, and schema-less data.

document revision A structure that represents a single iteration of a document’s full dataset. A revision includes both your application-defined data and QLDB-generated metadata.

  • Each revision is stored in a table and is uniquely identified by a combination of the document ID and a zero-based version number.

 

QDLB use case

Big organizations such as Accenture, digital asset and health Direct use Amazon QLDP for the following purposes 

  • banks can use QLDB to easily store an accurate and complete record of all financial transactions.
  • A ledger database can be used to record the history of each transaction, and provide details of every individual batch of the product manufactured at a facility. In case of a product recall, manufacturers can use QLDB to easily trace the history of the entire production and distribution lifecycle of a product.
  • Insurance companies can use QLDB to accurately maintain the history of claims over their entire lifetime, and whenever a potential conflict arises, QLDB can also help cryptographically verify the integrity of the claims data making the application resilient against data entry errors and manipulation.
  • By implementing a system-of-record application using QLDB, customers can easily maintain a trusted and complete record of the digital history of their employees, in a single place.
  • With QLDB, retail companies can look back and track the full history of inventory and supply chain transactions at every logistical stage of their products