I’m often asked about the skills required to have a career working with data. As a data analyst, you may be responsible for data mining, analyzing existing data assets, and presenting insights based on the needs of your company. As a data scientist, you may also be responsible for identifying methods for acquiring high volume/velocity data, creating machine learning models, and making predictions using mathematical & scientific methods. If you’re interested in careers that solve problems with data, here are some fundamental skills to develop:
Fundamental Skills To Develop
- Python – A general purpose programming language, object oriented, extensible with a wide range of libraries.
- 66% of data scientists are using Python daily and 84% of them use it as their main language
- Top Python libraries: Tensorflow (ML solutions), Keras (deep learning), Scikit-learn (ML), NumPy (data analysis/ML), PyTorch (deep learning)
- R – An open-source programming language used for with robust visualization libraries (ggplot2, plotly)
- 47% of data scientists are using R.
- It is used by 70% of data miners
- Good For: statistical analysis and modeling, analyzing structured and unstructured data
- Structured Query Language – combines analytics with transactional capabilities.
- 32% of data scientists use SQL
- Good for: data management, transactional capabilities.
- Data Visualization – for exploration, storytelling and communicating quick insights.
- Tools: ggplot for R, matplotlib for Python, Tableau, Power BI, SAC
- Statistics – To support a general understanding of probabilities, distribution, sampling, hypothesis testing, confidence intervals, variables
- Spreadsheets – The all purpose data review(er)/calculator.
- Tools: Excel, Google Sheets
- Algebra & Calculus – To support a general understanding of how algorithms work under the hood.
Your Development Space : IDEs & Dev Tools
Tools for writing software.
- Jupyter notebooks – Provides an interactive programming interface in a notebook environment. Good for: rapid prototyping, visualization.
- PyCharm – Python IDE can support single or multi file/language projects. Good for: useful for writing code for production.
- VS Code for Python – Python IDE based on Visual Studio
- Github – version control, tracking and recording code changes
- R Studio – IDE for R.
Safe Spaces to Practice : Communities & Open Datasets
Finding open communities & data sources for practice projects.
- Google Dataset Search – over 25 million datasets indexed
- Data.gov – Open data lake provided by the U.S. Government
- Kaggle – Real-world datasets provided to the kaggle community for collaborative problem solving
Thinking About Solving Problems With Data
Here are more resources and considerations for people that want to solve problems with data.
- Careers – Identifying your North Star (Innovative Career Strategies)
- Building Teams – Building your team (AI & New Rules for Solving the World’s Biggest Problems)
- Discovery – Discovering opportunities to make small changes that lead to big results (Identifying the Area of Highest Leverage)
- Data – Evaluating your data assets (Data Iceberg Model)
- Analysis – Navigating assumptions (Understanding the Impact of Assumptions)
- Technologies – Knowing when to use what (Characteristics of Machine Learning applications)
- DevOps – Integrating agile development cycles with machine learning workflows (Agile Machine Learning)