#rapidace v2 now auto-builds cross-reference data

Yes, I’ve managed to code a very sophisticated structural matching algorithm that uses ontologies for input, a source and target DDL set of structures, and even (if you already have it) a partial cross-reference document.

RapidACE v2 is very smart in this manner.  I think IBM has a tool they sell for $150k or more that “builds cross-reference documents”.  But RapidACE v2 goes a few steps further.  The matching algorithm is unique, special – and will use a number of different components to determine a matching score!

Yes, it goes way beyond v1 – RapidACE v1 stuck to primary and foreign key declarations, RapidACE v2 actually uses sophisticated text parsing algorithms, along with naming conventions, and defined ontologies as guides.  Of course, for the really tough problems where no possible match can be found – you can also pre-define a sparse XREF (source to target) for that specific match; and voila! RapidACE v2 will take that in to account.

It’s a very cool process, the end result is a matching score – you can change the output threshold up or down (out of 100%) to determine what gets produced to the Excel cross-reference.  That’s right, RapidACE v2 actually READS and WRITES directly to Excel spreadsheets.

The other cool thing is: once you have “generated” your cross-reference, you can edit it in Excel – tighten it up, change it, modify it, then push it back in to RapidACE v2 in order to generate accurate ETL code for SSIS, SQLServer, or Informatica v9.x

See you on the flip side,
Dan Linstedt

RapidACE v2 generates Automated Test Cases for Data Vault

I’ve been hard at work on a new On-line class (soon to be released).  I’ve also added some templates to RapidACE v2 which generate automated DATA tests!!

Yep, that’s right, automated data testing against ANY Data Vault model that you can build.  The tests that are generated are built to test the standards, and are setup to check the data sets in the database – ensuring the integrity of your ETL routines!!!

No, these tests do not test “did I map the right field in the stage to the right field in the Data Vault” – what they do test are things like: Business Key uniqueness, matching composite relationships, and more.

The data tests in the Data Vault model will test:

  • Hubs
  • Links
  • Satellites
  • Hierarchical Links
  • Same-As Links
  • Transaction Links

The data tests that I generate can be run ANY TIME there is a change to the ETL layers, or any time a new systems’ ETL is built to load data to the Data Vault.

The On-line course will teach you:

  • How to test
  • How the tests work
  • What pass and fail mean for each test
  • How to record the results
  • Definition of each test (SQL and PseudoCode)

For now, I’m generating ANSI-SQL, by the end of next week I’ll be generating Informatica v9 Maps, Sessions, Workflows to run the data tests.  When run in Informatica, it will be easy to see when tests fail – as they will fail the workflows.

IF YOU ARE INTERESTED IN:

  • Having me generate these data test scripts for your Data Vault
  • OR: having me generate these data tests in Informatica for your Data Vault
  • OR: taking the up-coming on-line class

THEN Please, Contact me today.

Cheers,
Dan Linstedt

PS: This class will be made available at: http://LearnDataVault.com

RapidACE v2 Generates Test Data and More!

RapidACE v2 is now generating raw test data based on the Data Model.  It also generates Informatica ETL mappings for Stage to Vault, including maplets, shortcuts, reusable sessions, reusable transformations, and so on.  There are a whole host of generation options embedded in RapidACE v2 now that I have the template engine in working in full force.

Test data…

RapidACE is not the fastest, nor the most robust test data generation tool on the market.  A really good tool for generating huge amounts of test data is: RowGen from IRI, the CoSort Company.  RowGen is written in Assembly language and C++, and is threaded.  RowGen (with the right scripts) can generate MASSIVE amounts of test data very very quickly.  I really like the tool, and one of the things that I will do shortly is generate RowGen Scripts from RapidACE v2, so you have options.

For now, RapidACE v2 generates raw test data based on the DDL.  The test data does not pay any attention to foreign key constrains, so it’s best applied in generating test data for loading to the staging area.  However, it reads DDL structures, and a “row-count” configuration that tells it how many rows to generate.   If you are in a pinch and need some rows (10 rows to 1 Million rows) of test data for staging loads, then RapidACE v2 is the engine for you.  If you are looking for serious enterprise test data generation, then I’d highly recommend RowGen from CoSort (IRI routines).

Of course, you could always use the test data generation component built in to Pentaho to accomplish the task, or I could show you how to generate test data with the Java Transform directly inside Informatica.

Either way, it’s a great basic place to start, if you’re in a pinch.

Informatica PowerCenter Maps, Sessions, Workflows

RapidACE v2, also generates FULL BLOWN INFORMATICA POWERCENTER v9 mappings, shortcuts, sources, targets, maplets, transformations, sessions, worklets, and workflows.  If you can “draw it” in Informatica, then I can convert it to a template and generate it for you against specific “table classifications” like Links, Sats, Staging, Hubs, Dimensions, Facts, etc…   And NO… you DON’T have to be using Data Vault to use RapidACE v2!!!

Cheers mates,
Dan Linstedt
PS: anyone with questions should feel free to contact me (use the contact form here).

Code for v1 is uploaded to SVN

Hello everyone, I’ve uploaded  the source code for V1 to: http://sourceforge.net/projects/rapidaceos/  It’s been setup in the SVN BROWSE section rather than the FILES list, this way I can update the code as necessary.  The BSD license is embedded in each JAVA file (except for those that I don’t own).  I’ve added the Wilmascope files to the build because of the fact I added a “freeze graph” command to stop the 3D graph from moving.  Other than that, the Wilmascope code remains unchanged.

Soon, I will have a downloadable copy available for you here on this web site (pre-compiled) with the QuickTable code extensions, but it will be made available for those that register (for free) with this site.

I used Eclipse HELIOS release with all the latest updates.  I’ve also installed Java3D 64 bit, and have a more than capable machine running win 7 64 bit, Java 64 bit, and of course compiling everything on 64 bit.

Cheers,

Dan Linstedt

Source is coming soon!

You may be wondering why I’ve not yet posted the source code?  It’s because I have to clean it up first.  Remove the QuickTable components (which I purchased a license to, but am not currently allowed to distribute), and reset the Wilmascope code in to a separate JAR file, although Wilmascope is LGPL.  I also want to make absolutely certain that when I do release the code, and you download it – that it will actually compile for you.

I’m upgrading to the latest release of Eclipse and all the libraries that go with it.

Anyhow, it’s 100% java, and 100% swing, so you shouldn’t have any trouble with it.   In the end, there may be features I actually have to “remove” like editing tables, editing datatypes, and so on – features that use QuickTable, except to provide them in a pre-compiled Binary (which I will have to obfuscate, again because of licensing issues).  But let’s hope that doesn’t happen.  I am seeking the blessing of the author of QuickTable to release it as a JAR with my compiled app.  We’ll see how well that goes.

The source is coming soon, and it will only be the V1 product that I release to the code base.  You should consider v1 code “frozen” as well, as I am well under-way with V2, but not ready yet to release this…

Hope this helps,
Dan L

RapidACE v1.6.3 Production Walk-Through

Here is a full video of the 1.6.3 Production Software.

As you can see, it’s a work in progress.  V1 (as mentioned several times) DOES NOT generate ETL code…  the hooks are there, and some of the templates are there, BUT – the problems lie in the internal engine core with the cross-references.  There are too many core issues (buried in the design) to solve the code-generation problems.  Code Generation is the FIRST thing I solved in v2.

RapidACE Business Ovierview

What is RapidACE from a business perspective?  In this post, I will try to shed some basic light on the answer to this question.

RapidACE is all about enabling your IT team to create Enterprise Data Warehouse solutions faster, better, and cheaper – with standardization in mind.  It’s not just about the Data Vault Model.  You can use RapidACE even if you aren’t using the Data Vault Model.   RapidACE is an enablement tool, it’s not a silver bullet.  It’s not a data warehouse in a box.  It doesn’t produce a data warehouse in a box either.  It’s an open architecture tool – that allows you to create base-line data warehousing data models (architectures) in rapid fashion.  It can help reduce a 6 month task to 4 weeks (depending on your teams’ level of knowledge and understanding).  Obviously, it is a case tool; and like any case-tool, if you put “garbage data models in”, you will get “garbage data models out”.  That’s the nature of case-tools!

So, the more work you put in up-front, the better off you will be in the automated production side of the house.

RapidACE processing looks like this:

 

But it takes MORE inputs than this!  It takes configuration files, and TEMPLATES (pre-written scripting code) to do it’s full work.  However, the power is still there for full iteration.  RapidACE can read DDL text files (data model structure declaration files) from MySQL, Sybase, ODBC, Oracle, SQLServer, Teradata and DB2 UDB.  It can also read 3rd normal-form-ish (normalized model formats), Data Vault class models, staging area models, star schema models, master data models, and exploration mart-like models.

“what-if” analysis with your architecture?

YES!  The beauty of this process, is because it’s open-architecture, you can forward engineer and generate a solution RAPIDLY (what used to take months, now takes days), then if you don’t like the result, you can MODIFY the input or output, and re-generate!  Leaving you valuable time to get your architecture right.


RapidACE SourceForge Project

The RapidACE SourceForge project is at: https://sourceforge.net/projects/rapidaceos/  I’ve not uploaded any files yet as I’m still working on incorporating the Open-Source license agreement to the code base.  Soon, very soon, I will upload Version 1.6 source (with a GUI), and v2 (command Line Only).  A short review of the versions is below:

Version 1.6 – Production Edition

This version is a working edition containing the following features:

  • GUI
  • Configuration Files
  • DDL Parsing
  • Merge & Consolidate
  • 2D and 3D visualization (* note: 3D requires a decent Graphics Processing Chip (GPU))  Also requires VRML processing capabilities
  • Forward Engineering: From 3NF to Stage, To Data Vault, To Star Schema, To Master Data Model, to Exploration Marts
  • Generation of Basic Cross-Reference (Source To Target Mapping)
  • Conversion of DDL from one database type to another.
  • IMPORT of Cobol Copybooks

Outstanding / known issues:

  • Problems with the DDL parser “hanging” with quotes, and complex MySQL DDL, especially comment blocks.  If you want the DDL to parse properly, try to clean it up first.  Remove all comment blocks, and remove all “quoted” field names, and the like.  For SQLServer DDL – remove “brackets and spaces” from within field and table names.  Remove all Stored Procedure and Function “code” from the DDL prior to importing.  And most importantly, BREAK ANY CYCLICAL FOREIGN KEY REFERENCES…  If Users table references Employees and Employees References Users table, then this is a cyclical foreign key.  Remove one of the FK declarations.  NOTE: change your DDL text file to ASCII.  This parser does NOT understand Unicode Text Files.
  • Code Generation Templates – missing, and/or too difficult to complete.  The cross-reference internal representation is a bit messed up.  This is why V1 code generation never really “got off the ground”
  • Forward Engineering Control – there are a lot of things the “3NF to Data Vault” forward generation engine does wrong.  Use the output only as a guide for future generation.  NOTE: if YOU clean up and merge your source DDL First, and provide basic Foreign Keys (across disparate data models) before reading them in to RapidACE v1, then you can end up with a fairly decent result.  Otherwise, forward engineering can sometimes provide you with “garbage”.
  • NOTE: the Merging and Consolidation engine are based on TABLE and FIELD NAMING CONVENTIONS, and a bit of matching logic – statistically driven.  SO, if your data models are disparate, and the naming conventions don’t match, then the models will NOT be integrated without you providing a foreign key designation (in the DDL only) to tie the models together.
  • There are about 125 known bugs and issues with this current version, too many to mention here.
  • There are some code pieces that I was playing with around RDBMS (SQLJDBC) access, but those pieces were never fully implemented.  V1 uses 100% in-RAM processing.  Therefore, it is recommended that you “RUN” RapidACE v1 with as much RAM as you can give it (heap & stack settings).  I’ve been successful running a single DDL construct with over 1000 source tables through the entire process in a 32 bit Java environment.  That means it’s worked for over 6000 tables (all in RAM).

Bottom Line?  V1.6 was good, but the Core needed to be re-written from the ground up.

JAR dependencies:

  • FreeMarker
  • FMPP (freeMarker Pre Processor)
  • resources.jar
  • rt.jar
  • jsse.jar
  • jce.jar
  • charsets.jar
  • dnsns.jar
  • dns_sd.jar
  • j3dcore.jar
  • j3dutils.jar
  • localdata.jar
  • sunjce_provider.jar
  • vecmath.jar
  • Jama-1.0.2.jar
  • concurrent-1.3.4.jar
  • jung-3d-2.0-beta1.jar
  • jung-algorithms-2.0-beta1.jar
  • jung-io-2.0-beta1.jar
  • jung-jai-2.0.beta1.jar
  • jung-graph-impl-2.0-beta1.jar
  • jung-visualtization-2.0-beta1.jar
  • resolver.jar
  • fmpp.jar
  • freemarker.jar
  • bsh.jar
  • oro.jar
  • j3d-org-java3d-all.jar
  • j3d-vrml97.jar
  • collections-generic-4.01.jar
  • commons-collections-3.2.jar
  • commons-configuration-1.4.jar
  • xml-apis.jar
  • colt-1.2.0jar
  • sqljdbc.jar
  • glutil.jar
  • commons-codec-1.3.jar
  • commons-httpclient-3.1.jar
  • commons-logging-1.1.1.jar

Some V1 Screen Shots:


V2 – Alpha Edition

This version is a COMMAND LINE driven engine, it is a complete rewrite of the entire core engine.  It started with the parser, and the code-generation system.  NO CODE FROM V1 is USED IN V2!!  Other than the “libraries” that are pulled in.

  • Command Line Driven
  • XML Configuration Files (literally over 150+ options to change / manage / override)
  • Informatica v9 ETL – XML code Generation, including workflows, sessions, worklets, mappings, etc..  Source to Stage, and Stage to Data Vault – these are FreeMarker Templates
  • Test Data Generation – both Informatica Mappings (that generate Test Data using the Java Object), and Raw Test Data Files themselves.  Again, these are FreeMarker Templates
  • DDL Parser – faster, 100% re-written, leaner, meaner, and no longer “hangs” on dirty DDL.
  • Forward Engineering: Currently, 3NF to Stage Model is the only available component.  I am working on 3NF to Data Vault engineering as we speak.

Note: there is a GUI, but it’s not been fully implemented nor built.   I would NOT suggest turning it on.  I will turn it on when I’m ready to do so.

** NEW FEATURES TO COME IN V2:  Model merging and consolidation WILL eventually incorporate Taxonomies.  Hopefully when I get to that point, I will integrate Ontology engines and RDF/OWL import / export.  I will also be using Taxonomies/Ontologies as a guide to the model merging and consolidation efforts.  This will be a crowning jewel of the feature sets when I get there.   – SUPPORT ME and WHO KNOWS…  If you support this project, I might just get more features done.

DO NOT mix V1 with V2 (settings, files, etc…) the ONLY things you can mix are the DDL files.  The Two engines are completely different builds and 100% different source code.

By the way.  If you are NOT familiar with FreeMarker Scripting Language, I suggest you study it.  It’s EXTREMELY POWERFUL, and yes, the templates I’m producing (for the most part) will be made available to the public domain.

If you’d like to hire my services, I will gladly create custom templates for you – these can either be contributed to the open-source project, or owned by you as Intellectual Property.  Contact me today for more information.

RapidACE base Generic Model

RapidACE software uses MVC (loosely).   In that, I have a generic “data holder” for all data elements.  This data containter is called ITEM.  It is not inherited, nor abstracted.  All properties are assigned dynamically, and therefore the code abstraction “of what you can do with the data” is endless, without modification to the core set.

Furthermore, any “abstraction” happens in the controller layers – again, without Inheritance of the physical object.  This makes RapidACE core engine infinitely scalable, and easy to alter/update.  The core element or data holder is exposed here.

It’s called: “ITEM”.

[java]

/**
*
*/
package model.generic;

import java.util.ArrayList;
import java.util.List;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Set;

import freemarker.template.SimpleNumber;
import freemarker.template.SimpleCollection;
import freemarker.template.SimpleHash;

/**
* @author Dan Linstedt
* (C) Dan Linstedt, 2011-2012, All Rights Reserved
* Written for RapidACE
*
* Purpose: to hold ALL data elements, and to be a recursive container
* for additional data elements.  This way, an Item can represent a FIELD
* or a CONSTRAINT, or a TABLE with fields, or anything else I want it
* to be.  All “attributes” are programmatically added as objects,
* so they are NEVER private, but can be manipulated at run-time.
*
*/
public class Item extends LinkedHashMap<String,Object> {

/**
*
*/
private static final long serialVersionUID = 1597929376529169667L;

private Item parentMap;
private String XMLTag = “”;
private boolean vHasChildren = false;

// hold a COPY of array pointers to children objects
// this allows freemarker access direct through recursive hash
// and allows Java easy “function” reference to children
private LinkedHashMap<String,Item> private_children;

// hold a COPY of array pointers to PROPERTIES types objects
// for the same reasons as CHILDREN are private
//private LinkedHashMap<String,Object> private_properties;

// these are variables that are INTERNAL ONLY
// they are NOT added to the children, therefore are
// hidden from the view of FreeMarker
private LinkedHashMap<String,Item> internals;

/**
*
*/
public Item() {
internalInit();
}

/**
* @param initialCapacity
*/
public Item(int initialCapacity) {
super(initialCapacity);
internalInit();
}

/**
* @param m
*/
@SuppressWarnings({ “unchecked”, “rawtypes” })
public Item(Map m) {
super(m);
internalInit();
}

/**
* @param initialCapacity
* @param loadFactor
*/
public Item(int initialCapacity, float loadFactor) {
super(initialCapacity, loadFactor);
internalInit();
}

/**
* @param initialCapacity
* @param loadFactor
* @param accessOrder
*/
public Item(int initialCapacity, float loadFactor, boolean accessOrder) {
super(initialCapacity, loadFactor, accessOrder);
internalInit();
}

private void internalInit() {
private_children   = new LinkedHashMap<String,Item>();
//        private_properties = new LinkedHashMap<String,Object>();
internals          = new LinkedHashMap<String,Item>();
hasChildren(false);
}

public void hasChildren(boolean setchildren) {
vHasChildren = setchildren;
}

public void XMLType(String xType) {
XMLTag = xType;
}

public String XMLType() {
return(XMLTag);
}

public void name(String name) {
this.put(“name”, name);
}

public String name() {
return(this.getProperty(“name”));
}

public void destroy() {
//        destroyLinkedMap(private_properties);
destroyLinkedMap(private_children);

ArrayList<Object> vals = new ArrayList<Object>();
for(Object child:this.values())
vals.add(child);

for(Object child:vals) {
this.remove(child);
if (child instanceof Item)
((Item)child).destroy();
}
vals.clear();
//        private_properties.clear();
private_children.clear();    // clear the COPY of the objects
internals.clear();            // YOU are responsible for destroying other objects
this.clear();
}

/**
* DEEP COPY OF MYSELF TO ANOTHER OBJECT
* @return
*/
public Item copy() {
Item newCopy = new Item();
for(String prop:this.keySet()) {
Object val = this.get(prop);
if (val instanceof Item)
newCopy.put(prop, ((Item) val).copy());
else
newCopy.put(prop, val);
}

return(newCopy);
}

public void setCDATA(String cdata) {
this.put(“#CDATA”, cdata);
}

public boolean ignoreInLoop(String key) {
return(key.toUpperCase().matches(“XMLATTRS|#CDATA|HASDUPLICATENAMES”));
}

@SuppressWarnings({ “unchecked”, “rawtypes” })
private void destroyLinkedMap(LinkedHashMap myMap) {
ArrayList<String> keyList = new ArrayList<String>();
Set<String> props = myMap.keySet();
for(String key:props)
keyList.add(key);

for(String key:keyList) {
Object val = (Object)myMap.get(key);
myMap.remove(key);
if (val instanceof Item)
((Item)val).destroy();
}
}

public void put(String key,String value) {
//        private_properties.put(key, value);
super.put(key, value);
}

public void put(String key, Integer value) {
SimpleNumber myval = new SimpleNumber(value);
//        private_properties.put(key, myval);
super.put(key, myval);
}

public void put(String key, Float value) {
SimpleNumber myval = new SimpleNumber(value);
//        private_properties.put(key, myval);
super.put(key, myval);
}

public void put(String key, Double value) {
SimpleNumber myval = new SimpleNumber(value);
//        private_properties.put(key, myval);
super.put(key, myval);
}

public void put(String key, Long value) {
SimpleNumber myval = new SimpleNumber(value);
//        private_properties.put(key, myval);
super.put(key, myval);
}

public void put(String key, boolean value) {
Boolean myval = new Boolean(value);
//        private_properties.put(key, myval);
super.put(key, myval);
}

public void put(String key, List<Object> value) {
super.put(key, new SimpleCollection(value));
}

public void put(String key, HashMap<Object,Object> value) {
super.put(key, new SimpleHash(value));
}

public void put(String key, Item child) {
super.put(key, child);
private_children.put(key, child);
child.setParent(this);
hasChildren(true);
}

public void putDontChangeParent(String key, Item child) {
super.put(key, child);
private_children.put(key, child);
hasChildren(true);
}

public void setParent(Item parent) {
parentMap = parent;
}

public Item createChild(String key) {
if (this.childContainsKey(key))
return(this.getChild(key));
Item child = new Item();
this.put(key,child);
child.XMLType(key);
return(child);
}

public Item createChildMap(String childName) {
Item newItem = createChild(childName);
newItem.XMLType(childName);
return(newItem);
}

public String getAttribute(String key) {
Item child = this.getChild(“XMLAttrs”);
if (child==null) return(“”);
return(child.getProperty(key));
}

/* ————————————–
*
* These are routines used for interfacing with this
* hash array
*
*/

public void putProperty(String key, String value) {
this.put(key, value);
}

public String removeProperty(String key) {
return((String)this.remove(key));
}

public String getProperty(String key) {
Object itm = this.get(key);
if (itm instanceof String)
return((String)itm);
if (itm instanceof SimpleNumber)
return(((SimpleNumber)itm).toString());
return(“”);
}

public String getString(String key) {
return(getProperty(key));
}

public Integer getInteger(String key) {
Object itm = this.get(key);
try {
Integer newInt = Integer.parseInt((String)itm.toString());
return(newInt);
} catch (Exception e) {
return(0);
}
}

public Item getParent() {
return(parentMap);
}

public void putChild(String key, Item value) {
this.put(key, value);
}

public boolean childContainsKey(String key) {
return(private_children.containsKey(key));
}

public Item getChild(String key) {
return(private_children.get(key));
}

public Item removeChild(String key) {
Item itm = private_children.remove(key.toUpperCase());
if (private_children.size()==0)
this.hasChildren(false);
this.remove(key.toUpperCase());
return(itm);
}

public LinkedHashMap<String,Item> children() {
return(private_children);
}

public boolean hasProperties() {
if (this.size()>=1)
return(true);
return(false);
}

public boolean hasChildren() {
if (private_children.size()>=1)
return(true);
return(false);
}

public LinkedHashMap<String,Object> properties() {
return(this);
}

/**
* Drill into a path (dot notation) to get a CHILD item
* @param path – String (dot notation path)
* @return Item – child item
*/
public Item getPath(String path) {
if (path.equals(“”)) return(null);
Item match = this;
String[] keys = path.split(“\\.”);
if (match.containsKey(keys[0])) {
match = match.getChild(keys[0]);
int count = 1;
while(count < keys.length) {
String key = keys[count];
if (match.childContainsKey(key))
match = match.getChild(key);
if (match == null) return(null);
count++;
}
}
return(match);
}

public Item getInternal(String key) {
return(internals.get(key));
}

public void putInternal(String key, Item ptr) {
internals.put(key, ptr);
}

}
[/java]