Through my career as a Data Engineer I have noticed that most companies want to have a data lake. When asked about the objectives for a data lake, the people who are in charge of making it happen say that they want all of the company data available in one place for everyone to have access to it while eliminating data silos. When I make the same question to the employees what they think about a data lake project and what should the objectives be, the responses are always different but the most common are : "its something cool to work on", "we don't need it", and "we have been asking for something like this". The last one tends to be reporting analyst and data scientist that are having a difficult time getting their job done due to the limited amount of data they have access to. By now you noticed the difference in what stakeholders want from a data lake. Management wants efficiency and reduction of costs while employees want fast and easy use tools that can plug and play existing processes. Both sides tend to agree on the overall strategy of efficiency and eliminating silos but since there's barely any information of what a data lake is or what it should have as services. Sole contributors working in the project end up re-creating the current processes using new tools to hit management's goals. In other words a lift and shift with barely any or no code improvements, small or no new ways to process data or to deal with current problems. Data is being sent to the data lake and that makes upper management happy that all silos have been magically eliminated. After the company declares victory that they created a data lake they notice anyone barely uses it and that all of the current processes that were supposed to be using the data lake are still off the old data warehouses killing the project because no one cares anymore to fix the data lake project.
In this blog post I am going to present a data lake solutions architecture that actually works. This infrastructure can get you the data you want or need(if you have access to that data to begin with) and will give you some flexibility on the tools you want to have for your use case.
The One Size Fits Most Data Lake
The basic data lake has 3 components at its most basic level:
- Data Consumer/Producer
- Data Storage
- Data Querying tool
The data consumer/producer could be an sftp tool to push to the data to storage from data warehouse or other servers that are serving files. You can also set up and Apache Kafka cluster that can connect to multiple data sources and copy over entire tables, files, or updates, inserts and deletes. Second component is the data storage which could be Amazon's web services S3. The third component is the one most projects forget to address and that is the the data querying tool. There are some data querying tools like Presto, Snowflake, Redshift, Athena, Etc. With these 3 components you are 80% done in your project and depending on your data sources you it might be able to create a usefull data lake in less than 8 hours of work or it might take months. I recommend that a data lake service creation and architecture design does not exceed 6 months. I put 4 months because some companies are really slow to approve changes and slowness is just going to be red tape. also if you take to much time ideating the data lake other problems in the company might take the spotlight and some will say that if the data lake was working none of us this would be happening. making you and the team responsible feel the pressure to deliver without clean milestones and additional goals.
Expand on tools on the 3 componenets
expand on optional components and why
- data catalog
- security and access
- data storage formats
- data analytics