Quantcast
Channel: DivConq Framework » JSON
Viewing all articles
Browse latest Browse all 7

Migrate a Relational Database into Cassandra (Part IV – Northwind Import)

$
0
0

This article shows how to prepare and import a dataset expressed in Cassandra-friendly JSON into a Cassandra datastore using Cassandra’s “json2sstable” utility.

Before proceeding, you should understand my previous “Part 3″ article on “Northwood Conversion” – this article imports the JSON dataset created in that article.

You should also have downloaded and installed either the .NET version or Mono version of the DivConq JSON command-line utilities, and should also have a complete JSON document from a conversion of the Northwind database export. (You can also start with the “JSONout7.txt” document from this archive.)

Cleaning the Data

So far everything we’ve done has simply moved data around.  This has led to a JSON structure that contains everything and then some from the original relational database.  We could import that, but from here on out we’ll treat this data more like a traditional data warehouse by only working with a subset of the original data.

To do the stripping, we can use a DivConq utility called “StripNodesFromJSON”.   The following batch snippet cuts out extra nodes (like “Shippers”) and tags (like “Phone”).

rem Let us turn this structure into something we can use in a data warehouse
rem Strip a lot of the extra tags out
rem x Get rid of the Shippers node
StripNodesFromJSON JSONout7.txt JSONout8.txt Shippers
rem x Get rid of extra nodes from the Employee node
StripNodesFromJSON JSONout8.txt JSONout9.txt PostalCode Photo Address ReportsTo HireDate HomePhone Notes BirthDate Extension
rem x Get rid of extra nodes from the Customer node
StripNodesFromJSON JSONout9.txt JSONout10.txt City Phone Region ContactTitle Address PostalCode Fax ContactName
rem x Get rid of extra nodes from the ItemEntry nodes
StripNodesFromJSON JSONout10.txt JSONout11.txt OrderID Product_UnitPrice Product_UnitsInStock Product_QuantityPerUnit Product_ReorderLevel Supplier_City Supplier_Region "Order Details_AutoID" Product_CategoryID Supplier_ContactTitle Supplier_ContactName Product_Discontinued Supplier_HomePage Supplier_PostalCode Supplier_Address Category_CategoryID  Category_Picture Category_Description Supplier_Fax Supplier_Phone
rem x Get rid of extra nodes from OrderInformation
StripNodesFromJSON JSONout11.txt JSONout12.txt OrderID, ShipPostalCode ShipCountry CustomerID EmployeeID

If you do a directory listing on the intermediate files created in this batch file you should see that each one is smaller than the one before it.

Cassandra’s JSON2SStable Format

If you’ve worked with Cassandra’s SStable2JSON utility, you’ve seen that the format Cassandra uses for its JSON datasets is not human-readable.

Cassandra’s SStable2JSON utility will export plain (no supercolumn) Column Families like this:

{
  "HotWheelsCar": [
    ["5072696365", "312e3439", 1278132336497000, false],
    ["53656374696f6e", "56656869636c6573", 1278132515996000, false]
  ],
  "GumDrop": [
    ["5072696365", "302e3235", 1278132306875000, false],
    ["53656374696f6e", "43616e6479", 1278132493790000, false]
  ]
}

…and will export supercolumn-filled Column Families like this:

{
  "ultralights": {
    "756c3233": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["7365617432", "392070656f706c65", 1283394499763000, false]
      ]
    }
  },
  "planes": {
    "706c616e65313436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773138", "372070656f706c65", 1283394371843000, false],
        ["726f773237", "322070656f706c65", 1283394387348000, false]
      ]
    },
    "706c616e65353436": {
      "deletedAt": -9223372036854775808,
      "subColumns": [
        ["726f773232", "332070656f706c65", 1283394349929000, false]
      ]
    }
  },
}

Several things are different than our JSON sets to date:

  • The data and supercolumn names are in hex rather than strings.  For example, instead of “Price”, you see “5072696365″ in the JSON above. (Use this to try it yourself.)
  • There are extra strings, such as “deletedAt” and “false”.  Fortunately, it appears that these can be faked up.
  • Columns are filed under “subColumns” node within each supercolumn entry.
  • JSON array structures are used in place of hierarchies.

…but it’s not impossible, it’s just different.

Creating JSON2SStable Format Files

Before we continue, we need to wrap our JSON datasets in one more node to represent the datastore – so far our top level has been column families, and the only remaining column family is now “Orders”.

Fortunately we can do this without a special utility: just a few lines of a batch file are needed to add a top-level “Northwind” node.

rem Add an extra wrapper for the name of the datastore
echo { "Northwind" : > JSONout12a.txt
type JSONout12.txt >> JSONout12a.txt
echo } >> JSONout12a.txt

Now we’re finally ready to use a DivConq utility to convert our human-readable JSON into the format needed by Cassandra’s JSON2SStable utility.  This part is easy.

rem Now convert wrapped dataset to json2sstable-ready
rem Cassandra array import format
PrepJSONForSSTableImport JSONout12a.txt JSONout13.txt

Now you should have a new, larger file filled with all the information Cassandra will need for its native import utility.

The Whole Export, Convert and Prep for Import Process

You may have noticed that the DivConq utilities ship with an “exportandimport.bat” file that performs all the steps covered so far.  Running this batch file should generate output like this.

C:\divconq\dotnet>exportandimport
22:38:39 Found expected organization in the "Orders" object.
22:38:39 Found expected organization in the "Order Details" object.
22:38:40 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:40 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:40 Found expected organization in the "Orders" object.
22:38:40 Found expected organization in the "Employees" object.
22:38:41 Completed OK.  Moved 830 children and found 0 orphans.
22:38:41 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:42 Found expected organization in the "Orders" object.
22:38:42 Found expected organization in the "Customers" object.
22:38:43 Completed OK.  Moved 830 children and found 0 orphans.
22:38:43 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:43 Found expected organization in the "Orders" object.
22:38:43 Found expected organization in the "Products" object.
22:38:45 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:45 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:46 Found expected organization in the "Orders" object.
22:38:46 Found expected organization in the "Suppliers" object.
22:38:48 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:48 WARNING: MergeAsName does not contain an [ID] or other macro.  This cou
ld lead to invalid JSON through duplicate keys in merged children!
22:38:50 Found expected organization in the "Orders" object.
22:38:50 Found expected organization in the "Categories" object.
22:38:52 Completed OK.  Moved 2155 children and found 0 orphans.
22:38:53 Found expected organization in the "Orders" object.
22:38:55 Completed OK.  Moved 11620 children and found 0 orphans.
22:38:57 Completed OK.  Deleted 1 nodes.
22:38:59 Completed OK.  Deleted 9130 nodes.
22:39:01 Completed OK.  Deleted 6640 nodes.
22:39:02 Completed OK.  Deleted 43930 nodes.
22:39:03 Completed OK.  Deleted 4980 nodes.
22:39:08 Completed OK.  Did 39140 nodes.

This batch file and any of its commands can, of course, be modified to taste or to work with other datasets.

Importing Into Cassandra

If you simply run Cassandra’s JSON2SStable command you’ll see some short usage information.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat
Missing required options: Kc
Usage: org.apache.cassandra.tools.SSTableImport -K keyspace -c column_family <j
on> <sstable>

…but please use the following procedure to properly import your JSON dataset.

First, shut down your Cassandra client and server (if started).  Then do into your Cassandra folder and open up your “conf\storage-conf.xml” file.  Add the following entry to this file and save.  (You can substitute your own name for “NorthwindOne” as long as you use it consistently below.)

<Keyspace Name="NorthwindOne">
  <ColumnFamily Name="Orders"
     CompareWith="UTF8Type"
     ColumnType="Super" CompareSubcolumnsWith="UTF8Type"
     />
   <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
   <ReplicationFactor>1</ReplicationFactor>
   <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
</Keyspace>

Once you’ve saved this file, start the Cassandra server again.  If your configuration changes were accepted, this will create a new, empty directory in your Cassandra server’s folder store.

You should also fire up the Cassandra client to check that your new datastore is live.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system

Now, stop the Cassandra server again and shut down the Cassandra client again.  (The Cassandra client doesn’t respond well to the server going up and down.)

To properly invoke the JSON2SStable utility, use the following syntax, substituting the appropriate values and paths as necessary.

In the example below, “NorthwindOne” is the name of our keystore and must match the value we saved into the “conf\storage-conf.xml” file above.  “Orders” is the name of the new column family we will be creating and inserting our native-formatted JSON into.  The path to the “JSONout13.txt” file is, of course, the file we’re importing.  Finally, the path to the “Orders-1-Data.db” file indicates which Cassandra data file we will create.  Note that this file does not yet exist, but the rest of the path (the folder structure) must already be in place.

C:\work\apache-cassandra-0.6.3>bin\json2sstable.bat -K NorthwindOne -c Orders C
\divconq\dotnet\JSONout13.txt C:\var\lib\cassandra\data\NorthwindOne\Orders-1-D
ata.db

If this works correctly, it will take a few seconds to silently import the data and will then silently return you to the command prompt.  If you see any other output from this command, you encountered an error.

Another way to quickly confirm that data was imported successfully is to eyeball the Cassandra data directory.  This should now contain three new files: Order-1-Data.db, Order-1-Filter.db and Order-1-Index.db.

If you see entries like this, go ahead and fire up your Cassandra server and client again.

Working With Your Imported Data

Finally, it’s time to view the live data on the live Cassandra server. Try these commands first.

C:\work\apache-cassandra-0.6.3>bin\cassandra-cli --host localhost
Starting Cassandra Client
Connected to: "Test Cluster" on localhost/9160
Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
cassandra> show keyspaces;
NorthwindOne
Keyspace1
system
cassandra> get NorthwindOne.Orders['10778']['OrderInformation']
=> (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=ShipVia, value=1, timestamp=1269842588093)
=> (column=ShipRegion, value=, timestamp=1269842588093)
=> (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
=> (column=ShipCity, value=Lule?, timestamp=1269842588093)
=> (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
=> (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
=> (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
=> (column=Freight, value=6.7900, timestamp=1269842588093)
Returned 9 results.
cassandra> get NorthwindOne.Orders['10778']
=> (super_column=OrderInformation,
     (column=Freight, value=6.7900, timestamp=1269842588093)
     (column=OrderDate, value=12/16/1997 12:00:00 AM, timestamp=1269842588093)
     (column=RequiredDate, value=1/13/1998 12:00:00 AM, timestamp=1269842588093)
     (column=ShipAddress, value=Berguvsv?gen  8, timestamp=1269842588093)
     (column=ShipCity, value=Lule?, timestamp=1269842588093)
     (column=ShipName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=ShipRegion, value=, timestamp=1269842588093)
     (column=ShipVia, value=1, timestamp=1269842588093)
     (column=ShippedDate, value=12/24/1997 12:00:00 AM, timestamp=1269842588093)
)
=> (super_column=ItemEntry_1393,
     (column=Category_CategoryName, value=Seafood, timestamp=1269842588093)
     (column=Discount, value=0, timestamp=1269842588093)
     (column=ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductID, value=41, timestamp=1269842588093)
     (column=Product_ProductName, value=Jack's New England Clam Chowder, timesta
mp=1269842588093)
     (column=Product_SupplierID, value=19, timestamp=1269842588093)
     (column=Product_UnitsOnOrder, value=0, timestamp=1269842588093)
     (column=Quantity, value=10, timestamp=1269842588093)
     (column=Supplier_CompanyName, value=New England Seafood Cannery, timestamp=
1269842588093)
     (column=Supplier_Country, value=USA, timestamp=1269842588093)
     (column=Supplier_SupplierID, value=19, timestamp=1269842588093)
     (column=UnitPrice, value=9.6500, timestamp=1269842588093))
=> (super_column=Employee,
     (column=Country, value=USA, timestamp=1269842588093)
     (column=FirstName, value=Janet, timestamp=1269842588093)
     (column=LastName, value=Leverling, timestamp=1269842588093)
     (column=Title, value=Sales Representative, timestamp=1269842588093)
     (column=TitleOfCourtesy, value=Ms., timestamp=1269842588093))
=> (super_column=Customer,
     (column=CompanyName, value=Berglunds snabbk?, timestamp=1269842588093)
     (column=Country, value=Sweden, timestamp=1269842588093))
Returned 4 results.

You can pick other Order IDs and supercolumn values (e.g., “Customer”, “Employee”, various “ItemEntry_” values) to view those values too.

Next Steps

At this point you have the tools and documentation to not only import the Microsoft Northwind Access database into Cassandra, but similar databases as well. This concludes the “Migrate a Relational Database into Cassandra” series of articles.

The next set of articles will describe how to build a working application on top of Cassandra.

Troubleshooting

If you encounter errors during import, feel free to shut down the server and wipe all the data files from the folder.

Also look for entries like this in the “C:\var\log\cassandra\system.log” file; while this specific instance indicates an import problem related to importing extra data through this process, these errors are really telling you that the “Trains-2-Data.*” files are useless but that the older “Trains-1-Data.*” files are still good.

 INFO [main] 2010-09-02 21:48:20,172 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-1-Data.db
 INFO [main] 2010-09-02 21:48:20,177 SSTableReader.java (line 120) Sampling index for C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db
ERROR [main] 2010-09-02 21:48:20,183 ColumnFamilyStore.java (line 182) Corrupt file C:\var\lib\cassandra\data\TransSchedTwo\Trains-2-Data.db; skipped

Share


Viewing all articles
Browse latest Browse all 7

Latest Images

Trending Articles





Latest Images