I have new to data science and playing a role of a data architect/engineer in our team. Considering modern data science advancements and projects, what are the tasks that I should work on to enable the data scientist team to perform their data science tasks. We do have a barebones data science platform with apache spark and zeppelin integrated with amazon S3 as the data store. Some of the projects that data scientists plan to work on include predictive analytics, forecasting, content analysis and any new NLP projects that we come up with in the near future.
Based on this, what do you think I as a data architect/engineer should work on from the platform/engineering side that would enable them to work on current and future data science projects?
For .e.g I had some specific questions as a starting point.
- What is the platform support required for read from multiple data sources including files, social media, api’s?
- Do we need to think about building a data lake or a data warehouse?
- What is the platform support required by data scientists for working on NLP projects?
- What data visualizations tools support is needed for modern data science projects by data scientists?
This list may not be complete. Let me know your expert inputs on building data science platforms