#rapidace makes #agile #datawarehousing easier

Retweet

Hi folks, it’s been a long time since my last post, I am sorry for that.  I’ve been knee deep and heads down in implementation for customers!!  I’m pleased to announce that I am now delivering with astounding results.

Just to give you an idea:

The customer cleaned up a source data model, and identified the business keys – then placed foreign keys on the business keys in the model to be used.  The rough size of the source model is around 250 tables.  This process took them (as I understand it) about a week to do.

How long would this have taken the customer if they did it by hand?

IF the customer had done this by hand, my estimates show that it would have taken them roughly 6 months (at best) with a team of 3 full-time employees.

How long did it take you + RapidACE?

Overall, working time was 4 weeks for this.  However, I was adjusting things that are now set in place.  I would generally say this entire process should take no more than 2 weeks to turn around.  Tops.

RapidModel Produced (in 10 minutes)

  • Fully functioning cross-reference source to Data Vault
  • Hubs, Links, Satellites, Hierarchical Satellites

Internally, RapidModel also: folded foreign key relationships, handled field name changes, utilized a user-based ontology (same-as for field name recognition), scanned the source tables that were unconnected, and “connected” the proper table structures in to the model at the right points, used a highly advanced “magic” algorithm to detect similar field names which may have contained spelling mistakes and data type mismatches.

The end result? a fully optimized Data Vault model with Composite Hub Keys, driven by user-defined business keys, full satellite structures (both hanging on links and on hubs), Link structures – both required and optional links!! See my new post on http://danLinstedt.com for more info on this subject…

RapidGen Produced (in 10 minutes)

Fully functioning Informatica 9.1 ETL XML file, total object count: around 9,000 informatica objects

  • Shared Sources + Primary, Foreign Key + comments
  • Shared Targets + Primary, foreign key + comments
  • Shared Transformations
  • Mappings (using shortcutted objects) for Hubs, Links, Satellites, Hierarchical Links
  • For Satellites: mappings for Inserts, Updates, and End Date Handling
  • Sessions (reusable)
  • Worklets containing both Series driven (sequential) and Parallel executing sessions
  • Master workflow (for running all the sessions)

Total # of target tables: around 890
Total # of mappings: around 2500

Import to Informatica took just under 3 hours on my laptop (due to the huge file), but imported successfully with zero import errors.

What does this mean to me?

It means you can “turn around” your raw Data Vault data warehouse PLUS all the ETL loading code in about 1 day.  Talk about agile!

How can I access this?

You can access this by hiring me for a fixed bid contract.  I will run RapidACE for you, and guide you through the prep work, as well as ensure the deliverables meet your needs.

Note: you have full control over:

  • Table naming conventions
  • Informatica Widget naming conventions
  • Number of parallel executing processes put in a worklet
  • Using or not using shortcuts
  • Using or not using end-date maps
  • Using or not using Last-seen-dates
  • Changing the Cross-reference and the data model AFTER generation, but BEFORE ETL code generation, letting you customize your desired results!!

Can I customize my “mapping styles” for Hubs, Links, Satellites, etc..?

YES!  However, there is an extra charge for this service

What if I want SQL routines? Can you Generate Those for me?

YES.  I can generate SQL as your ETL of preference.  I can even generate Stored Procedure code if that’s what you want.

Contact me today for more information.

#datavault #RapidACE – v2 – About RapidXREF

Retweet

This is a descriptive entry that details the behavior of RapidXREF.  What it is, how it works.  Keep in mind this function is ONLY available in RapidACE v2!  RapidXREF (as described before) is a function that provides cross-reference information.

Overview of RapidXREF

This particular piece reads in: multiple source data models, 1 target data model, and a sparsely populated cross-reference spreadsheet (if you have it), and produces or builds a brand new fully-qualified cross-reference that maps every source table to every target table in a matter of seconds.

By sparsely populated, what I mean is: you can feed it a) no xref at all, and it will do a best guess, b) business keys to HUBS only and it will figure out the rest with 89% to 98% accuracy.  This makes the job of building these dreaded cross-reference spread-sheets that much easier!

Yes, it reads xlsx formats, and writes xls formats directly.

In essence, here is an image that might help with the visualization of its functionality:

Required Inputs:

  • Source Data Model
  • Target Data Model
  • Job Files / Configuration Files
  • ** Optional Input: Sparsely defined Cross Reference

The source and target data model no longer require any sort of primary or foreign key declaration for this process.  Nor do they have to be any particular data model type.  They can be star schemas, staging areas, 3rd normal form-ish models, cobol copybooks, data vaults, and any other type of model you wish to import.

The cross-reference (if provided) gives the engine a guide to follow for some of the more difficult field lineage pieces that it might not otherwise guess. Remember, the cross reference or cross-walk is *required* by 98% of all data integration projects that exist.  The cross-reference generally (usually) serves as documentation for the build out of the ETL code that is to follow.

The purpose of this function:

As stated previously, the purpose of this function is to generate the cross-reference document.  At first glance you might think – well that’s easy, just match field to field by name, and you’ve got it.

Unfortunately it isn’t quite that simple in the real world.  The new v2 engine takes many things in to account for matching including (but not limited to)

  • Field names themselves
  • abbreviations (short and long names)
  • naming conventions (prefixes and suffixes)
  • camel case, underscores, spaces, and dashes
  • and a basic ontology of business terminology (including groupings of terms)

It uses an advanced algorithm that provides a lot more intelligence around the way it scores matches, and potential matches – to come up with matched tables (from source to target), along with matched fields.  The algorithm goes way beyond just simple field name matching.

The algorithm is the secret sauce here, and the best way to understand it is to watch a demonstration of it, which I will post soon in video format.

What does it take to BUILD the XREF by hand?

Well, the typical process to build a cross-reference by hand for the first pass, can easily take a single person 2 to 3 months for 100 to 250 tables.  Then, there’s bound to be errors that need removing, so add another 3 or 4 weeks to that for a “first-cut” release.

With RapidACE V2, it would take maybe 3 days to a week to build the up-front configuration files for import, and setup some basic terminology.  Then it would take all of 10 minutes for the tool to generate the first pass output.

In other words, you can easily shave off months from your project just by using this kind of a tool.

Do I need to modify what’s produced?

Yes. RapidACE v2 is over-zealous, and sometimes matches things that “are matches” but are fields that you don’t want matched.  For instance, suppose you have 3 fields in every source and every target like: “update user, update timestamp, update code” – and these are system fields.  This function would match all 3 of these fields in all the source tables to all 3 of these fields in each of the target tables.

But there’s a configuration file for that – you can enter the “fields to ignore” during matching, and the issue is fixed.

However, what the tool produces is either 80% to 90% of what you want, or it’s 120% of what you want, and you need to augment it either way.  Adjust it for simple business rules, delete the cross-mappings that you don’t want, and you can go on your way.

So what happens if we change the Data Models?

EASY!  You push it back through!  You can actually take the modified or generated cross-reference, and set it as the sparse input (even if it’s already fully qualified), and ask the engine to synchronize to the source model, or synchronize to the target model.

Again, saving you hours and hours of trying to capture the changes by hand, while retaining the modifications you’ve already put in to the cross-reference.

What are the benefits again?

Well, generally in this era of data warehousing and data integration, you probably want or need ETL code generation. Right?

In order to generate ETL code (or SQL code), you must have a source to target mapping done somewhere.

This function of the RapidACE v2 tool does this for you.  Produces the source-to-target mapping in Excel!  For you to modify by hand, manually manipulate, then push in to your code-generation system downstream.

The benefits include:

  • Reduced errors in ETL code generation
  • Reduced time-line for implementation of data integration projects
  • Increased productivity of your EDW / Integration Team
  • Generated Documentation that can be Manually adjusted (as necessary by business)
  • Automated adjustment / synchronization of the cross-references when the data models change!!

Again, the purpose of this function is to assist with the ETL code generation, and to provide a huge time savings to your data integration team.

So how do I get the tool? work with the tool?

Simple, you hire me on a contract for a fixed price bid you can get a one-time generation or synchronization or both.

Contact me for more information. (use the contact us form here on the site)

#datavault #rapidace – v2 Makes Strides

Retweet

Hello everyone.  RapidACE v2 has made tremendous strides forward in the past year.  It is coming along quite nicely in fact.  This post will introduce you to v2 major functions, and in a few of the new pages I will post – I will provide an overview of those new functions.

Please note: use the contact us form instead of trying to make comments, because comments are closed.

The idea and hope is: that RapidACE v2 will be made available in a SaaS format for members who are students at http://LearnDataVault.com/training classes.  Right now, we don’t have a membership area setup – just individual classes that you can purchase.  But in the future, membership will be an option, and hopefully RapidACE v2 will also be available as a service on that site.

In the mean time, you can hire me as a consultant to use RapidACE v2 on your projects, to help you speed it up.  Unfortunately RapidACE v2 cannot be purchased as a software package – because of the following reasons:

  • I’m not a software shop
  • I can’t support the customers who “would buy if they could buy”
  • there’s not enough functionality there without me behind the code to make it worthwhile for you to purchase it
  • it’s a complicated tool, requiring a ton of preparation to make it work properly

Ok, now that this is out of the way, let me also say this:

  • RapidACE V2 will never be open-sourced
  • RapidACE v2 is a complete re-write of the internal code
  • RapidACE v2 is 70x more powerful than RapidACE v1
  • YOU DO NOT need to use, or be using a Data Vault model to use RapidACE V2!!

Ok, so now the major functions…  Please note, at each step, you can edit the model and the cross-reference that are produced outside of RapidACE using your own data modeling tool set, then push the changes back in to either the same or another function.

RapidXREF

This particular piece reads in: multiple source data models, 1 target data model, and a sparsely populated cross-reference spreadsheet (if you have it), and produces or builds a brand new fully-qualified cross-reference that maps every source table to every target table in a matter of seconds.

By sparsely populated, what I mean is: you can feed it a) no xref at all, and it will do a best guess, b) business keys to HUBS only and it will figure out the rest with 89% to 98% accuracy.  This makes the job of building these dreaded cross-reference spread-sheets that much easier!

Yes, it reads xlsx formats, and writes xls formats directly.

RapidMerge

This function basically takes one or more source models including cobol copybooks and will merge them in to a single data model.  It will also allow you to set parameters which govern the merging of super and subset tables, or like structures together.  It also normalizes the copybooks formats.

This function also reads a simple ontology of business terms and abbreviations to allow it to detect naming convention differences across multiple models.  It uses this ontology of terms to connect and identify potentially similar table structures.

This function will also consolidate the data model, and apply reduction techniques and relationship folding techniques in an attempt to clean up the model.

As a result of the merge, it generates a cross-reference and a new data model which you can then take outside the tool, make modifications, and push back in to the tool – either RapidGEN, or RapidModel as a next step.

For instance, if you take a disparate set of star schema models, and push them through merge, you would end up with a “reduced” star schema model (if reduction is possible without changing grain).  Yes you can take disparate star schema models, push them through this function in an attempt to end up with an enterprise single consolidated star schema model.

By the way, IBM has a product for this, and I think (if I’m not mistaken) they charge over $100,000 USD for this product.

RapidModel

This function does what you think it does.  It forward engineers data models.  It will read in one source data model and then builds new integrated models: staging models, data vault models, star schema models, exploration mart models, and master data models.

It also produces or generates the cross-reference for you.

You can then take both the cross-reference and the data models, edit them outside RapidACE in your standard tool sets, and then push the edits back in to RapidGEN functionality.

This function no longer relies on Primary or foreign key declarations alone!  IF they are defined, it helps, but this function relies now on business key declarations along with sparsely declared cross-references to do it’s job.  It will also search for possible relationships that previously are undeclared.

RapidTest

This function produces SQL test cases based on a customizable template, that check a) the structural integrity of the table, b) the data set in the table: for instance, balancing staging tables to Data Vault tables, or raw dimensions and raw facts.

This is merely a time saver for assisting with unit testing of your ETL routines.

RapidData

This function produces very small amounts of test data for your model.  Quite frankly, this is not the best function nor the best use of RapidACE v2.  If you need LARGE scale test data, you should really look at the RowGen tool from CoSort (I.R.I. and company), that tool is phenomenal.

This tool is limited, but will produce up to 10,000 rows per table for testing.  It’s especially helpful for “loading the staging area”, then running the unit tests generated by RapidTest after running the ETL to load the Data Vault.

RapidGEN

This function is what you’d expect it to be, yes, it’s a code generator. This function reads source data model, target data model, a cross-reference, job files, configuration files, and a template, and then produces code.  Yes, this is template driven.  No the templates are not public today.  Heck, the software isn’t even available except through me.

Today it generates loading code from relational source to relational staging areas, and from relational staging areas to Data Vault structures.  In the future it will produce additional code for staging areas to Star Schemas, and Data Vault to Star Schemas as well.

Today, it does produce Informatica v9.1 XML files with shared sources, targets, transformations – and shortcuts, mappings, maplets, sessions, worklets, and workflows ready to run.

Today it does produce stored procedure code for loading staging areas to a Data Vault for SQLServer 2008 R2.

In the future it will produce other things, like SSIS DTSX packages, and Pentaho Kettle XML and so on.

Conclusion:

Well that’s it folks, I hope you enjoyed this overview.  Soon I will dive a bit deeper and show you the “process workflow”, as well as some diagrams that explain the working pieces.  In the future I also hope to post a few videos – BUT rapidACE v2 has no GUI it is command line driven for now, so that it’s easier to run it on the web as a service.

#rapidace v2 now auto-builds cross-reference data

Retweet

Yes, I’ve managed to code a very sophisticated structural matching algorithm that uses ontologies for input, a source and target DDL set of structures, and even (if you already have it) a partial cross-reference document.

RapidACE v2 is very smart in this manner.  I think IBM has a tool they sell for $150k or more that “builds cross-reference documents”.  But RapidACE v2 goes a few steps further.  The matching algorithm is unique, special – and will use a number of different components to determine a matching score!

Yes, it goes way beyond v1 – RapidACE v1 stuck to primary and foreign key declarations, RapidACE v2 actually uses sophisticated text parsing algorithms, along with naming conventions, and defined ontologies as guides.  Of course, for the really tough problems where no possible match can be found – you can also pre-define a sparse XREF (source to target) for that specific match; and voila! RapidACE v2 will take that in to account.

It’s a very cool process, the end result is a matching score – you can change the output threshold up or down (out of 100%) to determine what gets produced to the Excel cross-reference.  That’s right, RapidACE v2 actually READS and WRITES directly to Excel spreadsheets.

The other cool thing is: once you have “generated” your cross-reference, you can edit it in Excel – tighten it up, change it, modify it, then push it back in to RapidACE v2 in order to generate accurate ETL code for SSIS, SQLServer, or Informatica v9.x

See you on the flip side,
Dan Linstedt

RapidACE v2 generates Automated Test Cases for Data Vault

Retweet

I’ve been hard at work on a new On-line class (soon to be released).  I’ve also added some templates to RapidACE v2 which generate automated DATA tests!!

Yep, that’s right, automated data testing against ANY Data Vault model that you can build.  The tests that are generated are built to test the standards, and are setup to check the data sets in the database – ensuring the integrity of your ETL routines!!!

No, these tests do not test “did I map the right field in the stage to the right field in the Data Vault” – what they do test are things like: Business Key uniqueness, matching composite relationships, and more.

The data tests in the Data Vault model will test:

  • Hubs
  • Links
  • Satellites
  • Hierarchical Links
  • Same-As Links
  • Transaction Links

The data tests that I generate can be run ANY TIME there is a change to the ETL layers, or any time a new systems’ ETL is built to load data to the Data Vault.

The On-line course will teach you:

  • How to test
  • How the tests work
  • What pass and fail mean for each test
  • How to record the results
  • Definition of each test (SQL and PseudoCode)

For now, I’m generating ANSI-SQL, by the end of next week I’ll be generating Informatica v9 Maps, Sessions, Workflows to run the data tests.  When run in Informatica, it will be easy to see when tests fail – as they will fail the workflows.

IF YOU ARE INTERESTED IN:

  • Having me generate these data test scripts for your Data Vault
  • OR: having me generate these data tests in Informatica for your Data Vault
  • OR: taking the up-coming on-line class

THEN Please, Contact me today.

Cheers,
Dan Linstedt

PS: This class will be made available at: http://LearnDataVault.com

RapidACE v2 Generates Test Data and More!

Retweet

RapidACE v2 is now generating raw test data based on the Data Model.  It also generates Informatica ETL mappings for Stage to Vault, including maplets, shortcuts, reusable sessions, reusable transformations, and so on.  There are a whole host of generation options embedded in RapidACE v2 now that I have the template engine in working in full force.

Test data…

RapidACE is not the fastest, nor the most robust test data generation tool on the market.  A really good tool for generating huge amounts of test data is: RowGen from IRI, the CoSort Company.  RowGen is written in Assembly language and C++, and is threaded.  RowGen (with the right scripts) can generate MASSIVE amounts of test data very very quickly.  I really like the tool, and one of the things that I will do shortly is generate RowGen Scripts from RapidACE v2, so you have options.

For now, RapidACE v2 generates raw test data based on the DDL.  The test data does not pay any attention to foreign key constrains, so it’s best applied in generating test data for loading to the staging area.  However, it reads DDL structures, and a “row-count” configuration that tells it how many rows to generate.   If you are in a pinch and need some rows (10 rows to 1 Million rows) of test data for staging loads, then RapidACE v2 is the engine for you.  If you are looking for serious enterprise test data generation, then I’d highly recommend RowGen from CoSort (IRI routines).

Of course, you could always use the test data generation component built in to Pentaho to accomplish the task, or I could show you how to generate test data with the Java Transform directly inside Informatica.

Either way, it’s a great basic place to start, if you’re in a pinch.

Informatica PowerCenter Maps, Sessions, Workflows

RapidACE v2, also generates FULL BLOWN INFORMATICA POWERCENTER v9 mappings, shortcuts, sources, targets, maplets, transformations, sessions, worklets, and workflows.  If you can “draw it” in Informatica, then I can convert it to a template and generate it for you against specific “table classifications” like Links, Sats, Staging, Hubs, Dimensions, Facts, etc…   And NO… you DON’T have to be using Data Vault to use RapidACE v2!!!

Cheers mates,
Dan Linstedt
PS: anyone with questions should feel free to contact me (use the contact form here).

Code for v1 is uploaded to SVN

Retweet

Hello everyone, I’ve uploaded  the source code for V1 to: http://sourceforge.net/projects/rapidaceos/  It’s been setup in the SVN BROWSE section rather than the FILES list, this way I can update the code as necessary.  The BSD license is embedded in each JAVA file (except for those that I don’t own).  I’ve added the Wilmascope files to the build because of the fact I added a “freeze graph” command to stop the 3D graph from moving.  Other than that, the Wilmascope code remains unchanged.

Soon, I will have a downloadable copy available for you here on this web site (pre-compiled) with the QuickTable code extensions, but it will be made available for those that register (for free) with this site.

I used Eclipse HELIOS release with all the latest updates.  I’ve also installed Java3D 64 bit, and have a more than capable machine running win 7 64 bit, Java 64 bit, and of course compiling everything on 64 bit.

Cheers,

Dan Linstedt

Source is coming soon!

Retweet

You may be wondering why I’ve not yet posted the source code?  It’s because I have to clean it up first.  Remove the QuickTable components (which I purchased a license to, but am not currently allowed to distribute), and reset the Wilmascope code in to a separate JAR file, although Wilmascope is LGPL.  I also want to make absolutely certain that when I do release the code, and you download it – that it will actually compile for you.

I’m upgrading to the latest release of Eclipse and all the libraries that go with it.

Anyhow, it’s 100% java, and 100% swing, so you shouldn’t have any trouble with it.   In the end, there may be features I actually have to “remove” like editing tables, editing datatypes, and so on – features that use QuickTable, except to provide them in a pre-compiled Binary (which I will have to obfuscate, again because of licensing issues).  But let’s hope that doesn’t happen.  I am seeking the blessing of the author of QuickTable to release it as a JAR with my compiled app.  We’ll see how well that goes.

The source is coming soon, and it will only be the V1 product that I release to the code base.  You should consider v1 code “frozen” as well, as I am well under-way with V2, but not ready yet to release this…

Hope this helps,
Dan L

RapidACE v1.6.3 Production Walk-Through

Retweet

Here is a full video of the 1.6.3 Production Software.

As you can see, it’s a work in progress.  V1 (as mentioned several times) DOES NOT generate ETL code…  the hooks are there, and some of the templates are there, BUT – the problems lie in the internal engine core with the cross-references.  There are too many core issues (buried in the design) to solve the code-generation problems.  Code Generation is the FIRST thing I solved in v2.