Image Courtesy : Oreilly
Source / Scrape : Like wiki pages, today's data source could be anything. Splunk query, Server log any device which produces data in any format (.csv, .log, .json, ,.log, .xml, .xls etc..). You need to be skilled in bunch of tools like Python / Pandas, Shell Scripts, JQ, CSV, Sed and so on..
Data Cleansing : Pandas is a good option, but its not limited to that. Community Edition of Pentaho PDI can help in data cleansing. SSDT (former SSIS) has transformations which help in data cleansing. Knowledge of RegEX will be of great help.
Database : Knowledge of database is key. Any RDBMS such as Microsoft SQL Server, Postgres, MySQL, Oracle) will do. I personally like MySQL 8 as it has NoSQL and API capability.
Explore : Jupyter / Anaconda / IPython + Pandas + Matplotlib is a good combination. SPARK with Zeppelin will also work, but too much work setting up the cluster.
Deliver : GraphQL or REST API is one option, accessing via DB (using ORM tools such as SQL Alchemy) or direct SQL will be time saving option.
Transform : Knowledge of powerful open source JS libraries such as D3.js, Chart.js will be helpful. Pentaho community edition is a good alternative. If you have access to Tableau (paid version) that is good too.
Knowledge of Docker will help you with one more more components given above.
Top comments (0)