Many technology breakthroughs have taken place in last 10-15 years and have brought to the current stage where automated data discovery (ADD) is achievable and will soon become the regular way of conducting data profiling/data exploration exercises. There was a time when pulling data from Teradata or Brio to Excel, Minitab, perform an analysis on this data used to take hundreds of hours but not anymore.
FAA had made it mandatory for airlines to share delays data in early 2000. It took us a good number of days to data cleansing, data transformation, data manipulation, and presentation.
We were able to find out insights from delays but guess what – 90 % of our time had gone in data preparation.
BI tools have eliminated most of these hurdles. It takes a few hours now to do it all and prepare the presentation charts.
We analyzed data coming from sensors of mining equipment to identify the optimal settings that gives good throughput. Guess what! It took us considerable time to understand the data. We used R, Power BI. Data behavior changed over months. Automated scripts of BI tools enabled us to understand the data quickly. Question now facing data science community is: Is human involvement needed to determine change in data behavior?
What is the first thing you do when you get periodic dataset that you might have seen before or might not have? You want to know how the data behaves. Is it similar to the behavior known before? You do check descriptive statistics, check for missing values, outliers, and skewness. Histogram is good on some occasions and box plot is good on other occasions. You do segmentation on some occasions and other times you do take a look at trendline. In case there are missing values, you would want to find out a rule or test an existing rule to replace them with representative numbers. Is the number of outliers higher than the acceptable limit? Is outlier an anomaly or a natural occurrence? In case it is historical demand data, you would like the trend, pattern.
These steps would be done by any data scientists before presenting the results. Why not automate them? but what if data type is same but data behavior changes? Can a tool identify the change in behavior and then offer another analysis to conduct?
Automated data discovery helps you do that. It allows you to go one step farther than BI tool, R driven scripts. Automated data discovery tool conducts automated data profiling as well as typical BI reporting. This allows for constant, up-to-date knowledge of your data and its qualities. ADD driven tool will find out the trend in every quantitative data type, find correlations between any combinations of quantitative data type (automatically) and let you know the results.
Every new dataset coming from upstream has some assumptions gone into the dataset. It is very important to conduct a data profiling before feeding the data to models. I would think that it is important to invest in automated data discovery before investing in machine learning. Don’t conclude based on insights you observed 6 months back. We need to keep an eye on data behavior frequently.
Ideas for automated data discovery comes naturally when one has used a good BI tool for a while. To be honest, you get addicted to find insights or patterns using Tableau, Power BI or other tools. Many BI tools had enabled connection with R. Microsoft SQL Server has also brought R within its offering. In other words, Algorithms and data preparation have come closer and accessible on one platform. This new offering required developer and data scientist’s mindset. End users want to generate insights without worrying about SQL Server or R code. This is where ADD steps in.
Automated data discovery tools do and will have the power to discover – cleanse, transform, run algorithms, generate insights, use the right chart type in an automated or semi- automated fashion. Automated data discovery tool will also become a daily use product in coming years. BI tools will perhaps transform themselves in automated discovery tools.
Ease of doing business intelligence is good as well as bad. People have a tendency to find the chart allowing them to tell a story. Chances of wrong inference are high when you are limited to an isolated insight. It is important to check data profile and apply a justified technique. Automated data discovery will bring data science into the hands of people who don’t have time to sit and analyze repeatedly. You will still choose the chart that tells the story right but chances of being misled by a good visual will be reduced significantly.
Vinay is a data scientist with far fledged experience in transportation, industrial engineering and B2B marketing. He works with teams to identify the inherent need of the business and how data analytics can help, particularly in maintenance (predictive and preventive).