Applying data science for effective strategic planning: Designing and building a data warehouse

Monday, March 6, 2017

Applying data science for effective strategic planning: Designing and building a data warehouse

Does your data warehouse belong in the cloud?

Are you interested in leveraging proprietary data that is already being collected by your organization or doing the same with readily available public data? Data science and business analytics should not merely be viewed as the latest buzzwords. But rather as a combination of formal disciplines in the arts and sciences, some possibly even traditional.

This article describes my thought process behind the choice of storage platform focusing on the Google Cloud Platform from a high level with respect to the volunteer work I performed for the JerseySTEM (a non-profit organization), where data from the State of New Jersey Board of Education had to be analyzed; primarily to meet the objectives of identifying under-served communities and efficiently allocating resources for Science Technology Engineering and Mathematics (STEM) educational initiatives.

This is not a technical "how to" document. There are already many other well written tutorials available on the internet so there is no real need to reinvent the wheel. However, documentation links have been included whenever they were referenced as part of my decision making process.

The Big picture

Although this article focuses on the planning and design of the data repository within a data analysis project, it is worthwhile noting where this fits with respect to the SAS Analytical Life Cycle (taken from the white paper on Data Mining From A to Z).

As shown above, we are merely concerned with data preparation at this point. But there will be considerable impact on downstream processes or significant productivity loss from rework if this early stage phase is not well planned or executed.

Considering cloud hosted services

Unless you were pulled into a project at the ground level, there would usually be a data repository already available in most situations. However specific to the JerseySTEM use case, although the external source data was previously established, there was no existing functional and working data repository already in place that was usable for performing the research and analysis that was required. As such, there was a need to build the database from scratch.

Being a non-profit organization project, it was extremely important to be sensitive to the requirement of being cost conscious. Even so, I did not want to compromise on the pre-requisites of having a robust, easily supportable and scalable model that outsourced cloud platforms provided - via Software as a Service (SaaS), Platform as a Service (PaaS) or Infrastructure as a Service (IaaS).

Hence, my first order of business was to consult with industry subject matter experts who had experimented with cloud technologies before as end users. I found that there existed a general consensus among them that cloud storage pricing was cheap to start with. But could potentially escalate if there was a future need for high bandwidth extraction of data due to egress charges; akin to the cost structures of back-end loaded mutual funds as an analogy.

That said, the biggest attraction to a cloud based data storage solution to me personally, was that it was lightweight in terms of support from the organizational angle, something extremely valuable to a non-profit organization since volunteers come and go.

Bearing in mind that every use case is different, we could control operating costs by:

Limiting egress activity, in our case, to our internal team of data analysts and scientists. Public data could be published to other more cost efficient platforms for general consumption without turning on public access to the database.
If need be, we could turn the database on/off (like the light switch in a room) whenever our data analysts needed to access the data, conditional upon an additional separate layer for public access as described in the previous point and given that the external sourced data is only updated annually.

Exploring different cloud storage types

Given our very specific requirements, I decided on a cloud based solution with Google since it was less cumbersome administratively as we were already using their email services. Furthermore, there was a requirement to convert a significant number of address locations from the school directory to GPS coordinates in an initial upload of school reference data into the database. Google was offering $300 in credits over a limited trial period, and the ability to reduce the amount of manual labor associated with free GPS look ups using the Google Maps Geocoding API albeit for a limited time sealed the deal.

At the end of the day, given that the data in question was highly structured and also not within the realm of big data (size wise), I chose a traditional MySQL Relational Database Management System (RDBMS) hosted on Google Cloud SQL. It didn't make sense purely from a data storage perspective to attempt to fit a square peg in a round hole by forcing the data into a NoSQL solution such as Google Cloud BigTable or Google Cloud Datastore. Despite spending some time considering and experimenting with NoSQL options simply due to their more affordable storage pricing model, it was still time well spent as they could be more efficiently and cost effectively deployed when the need arose for more complex data analyses in the future.

Data repository design

I spent years working as a front office desk developer in financial services. Those familiar with the industry would know that it basically entails direct interactions with traders and salespeople, involving high impact deliverables with time critical turnarounds. Having come from this background, I realized early on in my career, the importance of designing and building easily supportable and highly configurable applications. By applying similar design principles within the world of database schema design, this meant keeping the schema as flexible as possible with the ability to scale not just in terms of size but also functionality.

Extract, Transform and Load

Before we could proceed with any form of analysis, the data had to be extracted, transformed and loaded into a database, as is commonly known as the Extract, Transform and Load (ETL) process within the data industry.

With the education data, the majority of it sat in Excel spreadsheets and Comma Separated Value (CSV) files, mostly as some form of pivot table segregated by worksheets or files per academic year. All this had to be transformed into a format that would fit data uploads into a normalized RDBMS table, where data quality would then have to be validated and finally retrieved as Structured Query Language (SQL) queries formulated for analysis in the decision making process.

Data quality is paramount

Data Management Association (DAMA) Data Management Body of Knowledge (DMBOK)

As most data professionals are probably familiar with, the ETL process is only a subset of the 11 data management knowledge areas (see the Guide Knowledge Area Wheel) as stipulated in the The Data Management Association (DAMA) Data Management Body of Knowledge (DMBOK). But this topic is beyond the scope of this article. Interested readers can refer to the DMBOK for a deeper understanding of data management best practices. For now, it would suffice to say that data quality is paramount and reasonable efforts should be dedicated to any data science and analytics project to avoid the dreadful situation of "Garbage In Garbage Out"!

Conclusion

In summary, anybody planning on building a data warehouse from scratch should consider all options, from building and supporting it in house to vendor outsourcing with a cloud service provider. Be familiar with and aware of the different pricing packages available and the costs (both in terms of time and monetary investments) associated with each option, always factoring that in your final decision.

This post was first published on LinkedIn.

34 comments:

KomalMay 2, 2019 at 6:25 AM
very nice article thank you about data science. Data Science Training in Hyderabad
ReplyDelete
Replies
KomalMay 2, 2019 at 6:27 AM
nice information about data science thank you. Data Science Training in Hyderabad
ReplyDelete
Replies
ManipriyanJune 23, 2019 at 5:31 AM
This comment has been removed by the author.
ReplyDelete
Replies
SkOfficalJune 30, 2019 at 3:59 AM
Great post, and great website. Thanks for the information! 5g Technology
ReplyDelete
Replies
vijayJuly 26, 2019 at 6:11 AM

And indeed, Iím just always astounded concerning the remarkable things served by you. Some four facts on this page are undeniably the most effective I have had.
data science training in Chennai| Data scientist course in Chennai| Data Science course in Chennai|
|Best Data Science training in Chennai | Top Data science training in Chennai | Data science Institute in Chennai

ReplyDelete
Replies
marksonSeptember 16, 2019 at 6:45 AM
This huge change supports and actuating the Data Scientist over the association to adapt Big Data Training in Chennai. ExcelR Data Science Courses
ReplyDelete
Replies
Jackie Co KadDecember 4, 2019 at 3:17 AM
Great Article
Data Mining Projects

Python Training in Chennai

Project Centers in Chennai

Python Training in Chennai
ReplyDelete
Replies
TechdatasolutionsblogDecember 16, 2019 at 1:25 PM

Very Good Information...

SAS Training in Pune

Thank You Very Much For Sharing These Nice Tips..
ReplyDelete
Replies
subhaMarch 11, 2020 at 1:09 AM
Thanks for sharing this useful blog, easy to understanding the concept.
Data Science Course in Chennai
Data Science Courses in Bangalore
Data Science Training in BTM
Data Science Training in Marathahalli
Data Science Course in Marathahalli
Best Data Science Training in Marathahalli
Data Science Institute in Marathahalli
PHP Training in Bangalore
DOT NET Training in Bangalore
Spoken English Classes in Bangalore

ReplyDelete
Replies
360digitmgasJuly 29, 2020 at 5:09 AM
Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. artificial intelligence ai and deep learning in coimbatore
ReplyDelete
Replies
360digiTMGAugust 11, 2020 at 1:20 AM
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
360DigiTMG
ReplyDelete
Replies
Tino Best December 13, 2020 at 2:47 PM
But in artificial intelligence, we may be able to fix the problem in one quick erasing event and starting over. Too abstract for you, perhaps, but it is possible and therefore data science course in india
ReplyDelete
Replies
nayarMay 31, 2021 at 9:43 AM

Amazing Post. keep update more information.
Selenium Training in Bangalore
Selenium Training in Pune
Selenium Taining in Hyderabad
Selenium Training in Gurgaon
Selenium Training in Delhi
ReplyDelete
Replies
chitraJune 3, 2021 at 8:46 AM
Great post!! Thanks for sharing..keep updated
Python Training Institute in Chennai
Python Classes in Bangalore
ReplyDelete
Replies
chitraJune 7, 2021 at 8:00 AM
Fabulous post... Keep sharing
Best English Coaching Courses in Chennai
Spoken English Courses in Chennai
ReplyDelete
Replies
chitraJune 8, 2021 at 8:14 AM

Good post..Thanks for sharing..
Data Science Course in Chennai
Data Science Training in Bangalore

ReplyDelete
Replies
ramyapranauvJune 23, 2021 at 5:39 AM
Amazing blog...keep sharing
IELTS Training in Chennai
IELTS Classes in Chennai
IELTS Coaching in Chennai
IELTS Coaching centre in Chennai

ReplyDelete
Replies
chitraJuly 8, 2021 at 7:10 AM
Wonderful Blog.. keep updating
AWS Training in Chennai
Amazon web services Training in Chennai
AWS Training Institutes in Bangalore
AWS Certification Training in Bangalore
ReplyDelete
Replies
ReshmaAugust 14, 2021 at 7:21 AM
Such a great blog.Thanks for sharing useful information......
Best ios training in bangalore
Best ios training institute in bangalore
ReplyDelete
Replies
bruce wayneOctober 21, 2021 at 2:21 AM
Such a great blog. Thanks for sharing
best java training institute in chennai
ReplyDelete
Replies
Pavithra DeviFebruary 8, 2022 at 12:59 AM
This post is so interactive and informative.keep update more information...
dot net training in anna nagar
Dot net training in Chennai

ReplyDelete
Replies
mjNovember 6, 2024 at 6:51 AM
very precise and beautifully written, thanks for sharing
Data science courses in Hyderabad </a
ReplyDelete
Replies
RachanaNovember 21, 2024 at 3:36 AM
This article offers valuable insights into the process of designing and building a data warehouse, particularly within a cloud environment like Google Cloud. The focus on cost-efficiency, scalability, and data quality is especially relevant for non-profit projects. The decision to use MySQL for structured data and careful consideration of cloud storage options provides a practical approach for handling data effectively. Great read for anyone looking to start a similar project!
Data science courses in Gujarat

ReplyDelete
Replies
VijayNovember 22, 2024 at 2:28 AM
A very detailed topic briefly informed.
Data science courses in Pune
ReplyDelete
Replies
RICHADecember 4, 2024 at 6:29 AM
Got it! If you need help with a summary or a specific piece of information, feel free to ask, and I'll keep it concise and provide a link to the source if needed.
Data science courses in the Netherlands
ReplyDelete
Replies
P. Zaheer KhanDecember 5, 2024 at 7:40 AM
Fantastic article! Your step-by-step guide to building a data warehouse for strategic planning is spot on. I love how you’ve connected data science with actionable insights for business growth. This is a must-read for anyone in the industry. Well done!
Data science Courses in Sydney
ReplyDelete
Replies
IIM Skills Data ScienceDecember 6, 2024 at 5:26 AM
very nice, great article
Data science Courses in Canada
ReplyDelete
Replies
data scienceDecember 6, 2024 at 10:23 AM
Your content is always relevant, and the way you engage with the subject matter is both thoughtful and thorough
Data science Courses in London
ReplyDelete
Replies
iim skills DikshaDecember 22, 2024 at 9:54 PM
This article provides an insightful overview of designing and building a data warehouse, particularly for non-profits like JerseySTEM. The focus on cost-effective cloud solutions, thoughtful ETL processes, and data quality underscores practical considerations for resource-conscious projects. The emphasis on flexibility, scalability, and leveraging tools like Google Cloud Platform is especially valuable for those exploring data science-driven strategic planning. A well-rounded, informative read!

Data science Courses in Ireland
ReplyDelete
Replies
kritishaJanuary 7, 2025 at 1:27 AM
"I loved this post about data science and cloud database strategy. The way you combined both strategic planning and cloud technology really resonates with me. It’s a great guide for anyone looking to streamline their data operations. Thanks for the valuable insights
Top 10 Digital marketing courses in pune
ReplyDelete
Replies
usha singhJanuary 7, 2025 at 2:19 AM
Your article on data science strategy planning is insightful and well-structured. It’s a must-read for anyone involved in data-driven decision-making. Thanks for sharing your expertise!
digital marketing course in chennai fees

ReplyDelete
Replies
AnjaliJanuary 29, 2025 at 7:12 AM
nice post . Thank you for posting something like this.
digital marketing course in varanasi
ReplyDelete
Replies
AyushMarch 26, 2025 at 2:59 PM
Great insights on the importance of structured data management and cloud-based solutions for strategic planning! Just as data science relies on clean, organized data for accurate analysis, the healthcare industry depends on precise medical coding to streamline operations and ensure compliance. For those interested in leveraging data-driven approaches in healthcare, learning medical coding can be a game-changer. The intersection of data and healthcare is growing, and quality training in this field—like those available in Delhi—can open up exciting career opportunities! You can check this also - Medical Coding Courses in Delhi
ReplyDelete
Replies
David EgedeJuly 16, 2025 at 2:21 AM
Apply data science for effective strategic planning by designing and building a robust data warehouse. Transform raw data into actionable insights to enable smarter decisions, performance tracking, and long-term business growth. At the same time, achieve certification success with alid tests Nonprofit-Cloud-Consultant, featuring expert-verified, up-to-date practice questions. These tests reflect the actual Salesforce exam format, helping you master nonprofit cloud solutions and pass confidently on your first attempt with ease and accuracy.
ReplyDelete
Replies

Add comment

Monday, March 6, 2017