Why We Chose Working with KNIME

Knime review analytics data preparation sydney

As data practitioners or enthusiast, you must have realised or heard that 60-70% of your job would involve data preparation. So, it makes sense to have dedicated tools that creates reproducible workflows and also reduces the pain involved in these tasks. There has been varying amount of data transformation tools like Alteryx, Penthao, SSIS and Tableau prep, which to a large degree are one of the best in the market for BI types of task. However, these tools don't come cheap! There is a huge price tag to them. So, in the last few months here at White Box Analytics, we went on a journey to find an alternative to the costly ETL tools. We arrived at KNIME and we must say it is pretty good! So, we thought to share and highlight our experience with the open source tools.

KNIME is an open source data platform that has been around for more than 10 years now, just before the buzz of data science. It is a guided analytics software for data professionals and an easy to use interactive data platform that executes workflows in nodes. In the years it has been around, the platform has evolved from a standalone desktop software to a collaborative tool for data practitioners. It has also developed into other products to cater for enterprise users with the release of the KNIME Server and KNIME extensions. It can be described as a data platform as it covers most end-to-end workflow process of a data project. These include data transformation and import (ETL), visualisation and predictive modelling.

Specifically, our experience with the product has been with transforming and loading data into the data sources - popularly known as the Extract, Transform and Load (ETL) process - for our clients. With a fairly decent skill in data manipulation we found the software to be relatively quick to pick up and run workflows. It is very intuitive thanks to the development team as we find the panes to be really useful. The panes include KNIME Explorer pane, Node Repository pane, Description pane, Console pane, Outline pane, Workflow Coach pane and KNIME extension. See Figure 1.

Figure 1: A screenshot of the Knime desktop.

Figure 1: A screenshot of the Knime desktop.

As most guided analytics software and ETL tools, KNIME executes in nodes and the node repository pane groups these nodes into functions. These include a database group specifically for connecting and querying different databases. Out of the box, KNIME connects to the usual suspects databases; Microsoft SQL Server, MySQL, Oracle and PostgreSQL and can connect to any other data source using the arbitrary JDBC driver.

One other cool feature we found with the KNIME node repository is the file handling capabilities. Aside from being able to connect to Secure/File Transfer Protocol (SFTP/FTP) servers with the usual password authentication, we found connecting to the servers via public key to be seamless compared to other applications. Connecting to AWS S3 bucket was also a walk in the park as it was just a matter of installing the relevant extensions.

Something worth highlighting is the scripting capabilities of KNIME. This feature of KNIME emphasises why it is not just an ETL tool but a data platform. With this capability we found it easy to integrate other scripting languages into our workflow, which came in handy to execute some required tasks other than using the KNIME created nodes. So far, the languages we integrated with KNIME include R and Python, which are some of the most relevant languages in the industry, but it does have the ability to integrate other languages such as Java and JavaScript. This can be useful for those problematic csv files or if you would rather work in either R or Python but need reproducible visual workflows. Something to be very cautious about is the fact that some of its node terminologies could be a bit confusing to get your head around if you are using it for the first time. We found that we had some significant delays in completing a task just because we were confused about the input variable that goes into the source and target port for the “download/upload from list” node. This node connects to an FTP server and lists the files in a remote directory with the option to either download or upload to this directory, but the source and the target port expects a variable format name URI. It took quite a while to understand that the input variable which usually would be a file path had to be converted to a KNIME format called URI.

We did notice a cool community engagement. The KNIME community portal was really useful as it was well organised and it answered just about any question we had. This could be due to the maturity of the platform or using the right keywords but we just like to put it out there that it was fairly easy to find answers to most questions using the KNIME community portal. We found that the user of the platform are really engaged and this has led to the development of quite a number of KINME extensions. Currently, there is an extension for configuring slack messages because this is 2019!

There are two helpful features for beginners, these are Workflow Coach and the KNIME NodePit. The Workflow Coach pane is one of the features that makes KNIME stand out, particularly for beginners as it recommends a list of nodes with a percentage ranking that should be accompanied with a selected node. This feature just mimics the KNIME NodePit but in lesser details. The NodePit goes into greater detail, giving a detailed explanation about the node and also has a section called the Best Friends (Incoming and Outgoing) nodes that mostly accompanies a given node. The NodePit also has a section that give example workflows that incorporates the node in question into it. This is also super cool for beginner to re-engineer that example workflow to have a better understanding of the node.

Although there weren't a lot of difficulties working with KNIME, we found the desktop version wasn't built to handle very large datasets. While using KNIME desktop, the extent of the workload depended on the local server or computer's hardware capability as the temporary file are stored on the hard drive or memory. In our experience we also found that the desktop application struggle to handle a dataset of about 14 million records using a 16GB ram. Another downside faced with KNIME was its difficulty to successfully run stored procedures. Although there are workarounds for running stored procedures, we do hope that running them in later updates are as easy as just configuring a node.

This is by no means an exhaustive guide to using the KNIME desktop product, we believe there are multiple resources online that covers those topics. It is just a highlight of our experience with the software and an expression of why we feel it is a better alternative to the traditional SSIS ETL tool which seems not to have really evolved with the growth in the industry and a better alternative to the costly ETL tools like Alteryx and Informatica. There is something to be said about the open source movement and it’s importance to the data analytics industry, but we will leave that for a separate article. Until then, we hope you found this informative.


This review was written by Ije Iruemi, a data analyst for White Box Analytics.

For more data analysis and visualisations, click here.

Or, get in touch for a discussion about your data strategy.

Commentary, ReviewsGuest User