Data analysts are sometimes challenged to quickly understand and discover valuable insights from new data sets or data sources. This can be an overwhelming task, especially if there’s no metadata documentation available and/or no subject-matter experts available to help. When business leaders and executives are looking for fast answers to their questions, you need to be ready to respond with high quality BI and analytics. Here are some best practices to quickly familiarize yourself with uncharted data sources in SQL databases.
Explore the Schema
You’ll often need to begin by selecting sample data from many different tables. This will help you get an initial idea of what information is available and how/where it’s stored in the database. Make note of the tables that seem to hold data critical to the analysis you want to perform. Run COUNT and COUNT DISTINCT functions on all critical tables to discover which column(s) define a unique row in that table. For these critical tables, identify which columns seem most pertinent to your analysis and which columns might link up with other tables of interest. Innovative business insights are usually derived from data in many tables across the database, so table joins are important to determine at this stage.
Run Simple Statistics on Critical Tables
When you’ve identified the tables containing data critical to your analysis, you should explore those tables further with some high-level statistics. With sales data, for example, you might want to look at daily/monthly/yearly totals for sales revenue, number of customers, etc. You might look at the breakdowns by customer type, market region, product SKU, etc. Use the MIN and MAX functions to see the range of values in critical numeric or date columns. This will help you identify data that seems out of bounds and should potentially be excluded from your analysis (e.g. dates in the future, default dates like 1/1/1900, negative or blank revenue values).
Check Findings with Accepted Business Knowledge
Once you have an idea of where the critical data is stored and some high-level summary statistics about that data, it’s a good time to compare your findings with accepted business knowledge. Do the statistics you’re pulling align with published reports within the organization? If not, can you speak with the owners of those reports to understand the discrepancies? Is there a reporting application available that can help you check the summary statistics or even validate column definitions/values on individual records?
Document What You’ve Learned
It’s always a good idea to take notes as you go through this data exploration. As you complete the initial exploration, you may want to add structure and details to your notes so they can serve as good reference documents for you in the future. More than likely, you’ll end up sharing your documentation with others in your company who need to access and utilize the same data. Standard reference documentation is an important part of data management and governance that can help ensure everyone is interpreting and using the data in a consistent way.
Start your Analysis
Now that you have a better understanding of the available data, you are ready to begin your actual analysis! It’s important to continue to tread lightly at this point since you aren’t truly an expert on this data yet. When you believe you’ve discovered important insights, it’s prudent to run your findings by a knowledgeable peer for a sanity check before sharing more broadly. If the peer uncovers any incorrect logic or assumptions in your analysis, make sure you update your metadata documentation appropriately.