DEV Community

Cover image for Top Skills You Need in Testing Big Data projects
Renee Betina Esperas
Renee Betina Esperas

Posted on

Top Skills You Need in Testing Big Data projects

🙋‍♀️ Hey! Are you also a QA/Tester/Automation Engineer wondering how to test Big Data/AI-related projects? If yes, this is for you.

What is Big Data? What is AI? Most of us know these concepts but what is it like to actually work on it? Is it really testable?

Aside from knowing the Testing Fundamentals, testing Big Data projects will push you to skill up. I have listed here the combination of skills that helped me a lot in my previous big data project.

1. SQL/NoSQL

SELECT * FROM ____ WHERE <column_name> = _____;
This is basic. But you need to level up a bit. When working on a very large data set, the keyword LIMIT will come in handy. Why LIMIT? Big data = lots of rows you can't even count. Limit is used to specify the number of rows returned.

Another concept you might want to know about is partitioning.
What are partitions? These are smaller parts of a huge table. With partitions, queries will be faster. Partitions also help in 'organizing' your data. A common partition criteria is the 'date' (but it can be anything, depending on your DB or table structure and project implementation). So imagine a table with billions of records. Instead of traversing the entire table, it will only focus on a specific partition.
SELECT * FROM _____
WHERE <column_name> = _____
AND <partition_column_name> = ____
LIMIT ____ ;

Bonus: SQL Cheatsheet

SQL Cheatsheet
Credits to the original owner (btw, I added the DISTINCT keyword here!)

Bonus: Common Big Data Queries to check table definition and partitions
DESCRIBE <table_name>;
SHOW CREATE TABLE <table_name>;
SHOW PARTITIONS <table name>;
Enter fullscreen mode Exit fullscreen mode

2. Python / R

As a QA, you don't need to master it but being knowledgeable on Python/R (or whatever programming language your company uses) will definitely help you understand the Developers especially when they say that a certain fix is "complex".

Being confident in programming with Python is a real plus as you can also use this when automating your tests. There are lots of Python libraries that you can import for your UI/API/Data Testing.
(I'll create separate articles for Test Automation soon! 🙂👍 )

3. Background knowledge on Data Science / Statistics

What is clustering? What is k-means? What is Depth First Search (DFS) algorithm?
If you are working on big data projects, there are lots of new terms from developers that you don't usually hear from usual projects.

Check out MITx's Computational Thinking using Python XSeries Program
This online learning program comes with 2 courses:

I personally took this for about 3-4 months and it was worth it. Getting this course will help you understand data science/statistics terms that are usually used for Big Data/AI projects.

4. Background knowledge on Data Services

Again, this depends on what your company will use.
Getting to know your data services will definitely help you out in creating and expanding your test plan.
i.e. What if the storage path is missing or not configured? What if there is no new file? What if the location exists but has no data at all? What if I upload a bad/corrupted data file?

5. Big Data Warehouses and Other Tools

Hadoop, Hive, Apache Spark, Apache Airflow and more.
As a QA, you don't need to master them all. All you need to know is how you'll be able to navigate/monitor these tools for your testing.

In Summary ...

There is a lot to learn. It will be overwhelming at first but eventually it will be your new normal.

If there is one advice to (upcoming) Big Data QAs - "Never be afraid to ask your developers". Collaboration and communication is always the key.


Got QA or Automation questions? Feel free to reach out. I'm up for collaboration! 🙂

Top comments (0)