Don't see your question here? Get in touch with us!
The Private Data Warehouse (PDW) is a clone of your Magento BI data warehouse exported to an Amazon Redshift instance. PDW allows you to interact with and explore your data any way you like, whether it's running queries or layering on statistical analysis tools.
Note that PDW is an additional feature we provide for Magento BI, not a requirement. You can still interact with your data via the Data Warehouse Manager. PDW is intended for users who want to directly query their data, build custom applications, or layer third-party tools on top of their warehouse.
PDW updates are directly tied to Magento BI updates, so every updated Magento BI batch will have a corresponding PDW update that happens around the same time.
All data from Magento BI integrations are stored within tables inside a schema in Redshift titled rjm_current. This schema represents the latest batch of data pulled from your data sources.
In addition to rjm_current, another schema that starts with rjm_pending and a timestamp might also be visible at times. You can ignore this schema, as it’s used for loading the next batch of data and is renamed to rjm_current after importing is complete.
As new batches finish importing into Redshift, the existing rjm_current schema is dropped and the rjm_pending schema is renamed to rjm_current. This allows queries to be saved and reused while at the same time allowing for new batches of data to appear.
Because the rjm_pending schema replaces the rjm_current schema during updates, any database views that point to tables inside of rjm_current will be lost when a new batch of data arrives. We recommend you save the definition of these views and rerun them before beginning any analysis work.
The column names for each table should match closely to the column names you see inside the Data Warehouse Manager, but there are a few situations where they may be slightly different. This is to accommodate naming restrictions within Redshift. Data typing should also be consistent with what’s shown in the Data Warehouse Manager.
Many SaaS integrations used nested data structures, which means each attribute (or column) in a table could have its own set of attributes. PDW is designed to deconstruct these nested structures into separate tables to easier to query the data. Note that Magento BI currently supports only two levels of nesting.
Let’s use Shopify as an example. Shopify orders data uses nested records, which are comprised of three parts - an order record, line items, and tax lines. In the example below, we’ll only use a few attributes in each part to explain how the deconstruction works. Here’s what a Shopify order record might look like:
{ “order_id” : “1234” { “created_at” : “2015-01-01 00:00:00”, “customer” : “100”, “line_items” : [ { “product_id” : “5008798”, “price” : “5.99”, “quantity” : “1” } ] } }
Order Record (top level)
This is the core data that comprises the order, including the order time, the customer, and so on. The primary key for orders - order_id - is also included. PDW will store this table as orders in Redshift.
order_id [pk] | created_at | customer |
1234 | 2015-01-01 00:00:00 | 100 |
1235 | 2015-01-01 00:05:10 | 376 |
Line_Items (second level)
This level contains contains one or more sub records that hold the various items ordered as part of the top level order. Attributes include product information, price, and so on.
PDW will store this table as orders__line_items in the data warehouse, and will append a few additional fields to help connect these rows back to the top level record:
_id_of_parent | _sub_id | product_id | price | quantity |
1234 | 0 | 5008798 | 5.99 | 1 |
1234 | 1 | 3445689 | 10.99 | 1 |
1235 | 0 | 3435543 | 6.99 | 2 |
If we wanted to return all line items for order number 1234, we’d run the following query:
SELECT * FROM orders__line_items li WHERE _id_of_parent = 1234
We recommend always joining the top level table to the nested table - this will allow you to avoid queries that may have outdated data.
Comments