AJ_Coding

Posted on Apr 4, 2023

Github Guide for Data Scientists

#github #git #datascience

Source: Git Organized: A Better Git Flow | Render

INTRODUCTION

Git is defined as a free and open source distributed version control system designed to handle both small and large projects with speed and efficiency, according to the official Git website.

There are 2 types of Version Control Systems (VCS):

· Centralized VCS

· Distributed VCS

A centralized VCS (CVCS) uses a client-server model where all team members access a central repository to store and manage changes made to files. The client software enables users to checkout a working copy of the files to their computer, make changes and commit those changes back to the central repository. However, the downsides to a CVCS are slower performance when dealing with a large repository and no access when the servers are down.

On the other hand, a distributed VCS (DVCS) enables team members to have a complete working copy of the remote repository together with its history of changes in their local repository. This allows users to work on their files independently from anywhere without having to rely on a central repository to store changes. Local repositories from different team members can be merged hence enabling collaboration on the same files and working on different sections of the project simultaneously. As a data scientist, a DVCS such as Github is essential to our daily activities as we will demonstrate in the upcoming sections.

Download and Setup

First, we will need to download Git from the official website mentioned above and choose your operating system accordingly. Afterwards, you may head on over to github.com and sign up for an account. Optionally, you can download Github Desktop Application, if you prefer a more visual approach when performing commits, pushes, merging etc.

However, for the purpose of this article we will learn how to use Git commands in the Command Line Interface (CLI). It is crucial to understand Git from this basic level and this will in turn make using the Desktop Application a breeze. To confirm, Git has been installed successfully, you may run the command below in your CLI such as Command Terminal or Git Bash:

$ git –-version

If everything was setup correctly, the command will return the version of Git. Next, we’ll need to ensure to configure our name and email in order to identify ourselves with Git. You can run the commands below whereby you can replace “your name” and “name@email.com” with your name and email address respectively.

$ git config - global user.name "your name"
$ git config - global user.email "name@email.com"

Creating a repository

A repository or repo for short is a place where you can store and manage the code for a project. It generally contains all of the project’s code, documentation, and other files. Collaboration on a project, tracking changes over time, and sharing work with others are all made simple by repositories.

There are 2 types of repositories:

· Local Repository: A copy of a repository stored on your computer’s hard drive where you can work on the local version of your project. You’ll be able to work on your project without affecting the main branch or the changes made by others. The local repo enables the user to make changes on the project, create branches and test those changes before pushing them to the remote repository for others to review and merge.

· Remote Repository: A copy of a repository stored on the cloud or remote server, such as Github. With the aid of remote repos, you may work on a project with others, publish your code online, and back up your data. You may sync your local repository and remote repository by creating a remote repo on GitHub and pushing your local repository to the remote repository. By doing so, you may collaborate on project changes and share your code with others.

When working with Git and GitHub, you will utilize a blend of local and remote repositories. To edit the code, test your changes, and save your work, make use of your local repository. After that, you may share your work with others and work with your team by pushing your changes to a remote repository on GitHub.

Source: Author

To create our first repo, we will head over to github.com profile page, click on the + sign at the top right of the page then “New Repository.”

Afterwards, you can follow these steps you complete the creation of the repository:

· Give your repository a name. Its name has to be descriptive and reflect the goal of your project.

· You may optionally include a repository description. This can aid people in comprehending the purpose of your work.

· Choose whether you want your repository to be private or public.

· Decide whether or not to include a README file when starting a repository. If you want to give some fundamental information about your project, this is a fantastic option.

· Choose a licence if you wish for your project. The conditions under which others may use and change your code are laid forth in a licence.

· Click “Create Repository”.

Source: Author

Working Directory

In the working directory, is where we will have our project files that will be later pushed to the remote repository on Github. For illustration purposes, we will create a folder called “First” and inside it add a CSV file “repo.” We will also use Git Bash for our input commands. You can use the native Git Bash terminal that came with the successful installation of Git in your operating system or you may use Git Bash in your VS Code editor. Make sure to first navigate to the project folder in Git Bash before using the command. The command below creates a CSV file called “repo” in your working directory.

$ touch repo.csv

Next, we will need to initialize our repository in our working directory. To do so, we will use the Git command below. The command also creates a subdirectory “.git” that is typically hidden.

$ git init

An alternative method for this first step is to clone our new remote repository that we had created on Github.com using the command below:

$ git clone "Repository URL"

Source: Author

This will create a local copy of the remote repository in your working directory. One of the Git commands that we will be used quite often is “git status”. This command tells us which files are untracked, modified, tracked, conflicts etc. It shows the current status of the working directory and staging area in relation to the working directory. When we run “git status” after creating the working directory and creating a CSV file called “repo” we should get the outcome below in our terminal.

On branch main
No commits yet
Untracked files:
(use "git add <file>…" to include in what will be committed)
repo.csv

This basically tells us that Git is aware there is a file in the working directory known as “repo.csv” but it is not in the staging area yet and has not been committed. To track the file, we will use “git add.”

Staging Area

This is also known as the index. It is the intermediary step between the working directory and the local repository. For ease of understanding, the image below demonstrates the basic Git workflow and commands.

Source: ByteByteGo

The staging area allows you to review and choose which changes will be included in the local Git repository. To add files to the staging area you can use the “git add” command. If using a period (.), all files and folders in the working directory will be added to the staging area. If you’re working with many files and want to be selective, you can indicate the filename after the command.

$ git add .
OR
$ git add repo.csv

After we add the file to the staging area using the command above, we can run “git status” again.

On branch main
No commits yet
Changes to be committed:
(use "git rm - cached <file>…" to unstage)
new file: repo.csv

Now Git is tracking our new file as it has been added to the staging area. Also, if we want to unstage the file, we can use git rm –cached repo.csv if required.

Local Repository

Next, we will need to commit our file in the staging area to the local repository. A commit is a snapshot of the changes made to files and folders in a repo. To save changes you make to files in your repository as a new version of your project’s history, you must commit the changes.

Before we perform our first commit, it is a good practice to only commit the files that are needed to build and run our project. Unnecessary files such as log.txt files don’t need to be uploaded to our repository. To achieve this, we will need to create a .gitignore file and add files that we don’t want included in our commit here.

$ touch .gitignore

Also, we can create a log.txt file and include it in our .gitignore. We can do that using our VS code by opening .gitignore, typing in “log.txt” and save.

Source: Author

From this image, we can see that the “.gitignore” file is untracked by Git as shown by the “U”, “repo.csv” is added (A) to the staging area, and “log.txt” is ignored by Git. Let’s run “git status” to confirm this.

On branch main
No commits yet
Changes to be committed:
(use "git rm - cached <file>…" to unstage)
new file: repo.csv
Untracked files:
(use "git add <file>…" to include in what will be committed)
.gitignore

Based on the above, we need to upload the .gitignore file to the staging area first by running “git add .” before committing our changes to the local repository.

After adding all eligible files to the staging area and running “git status” again, we get the outcome below.

On branch main
No commits yet
Changes to be committed:
(use "git rm - cached <file>…" to unstage)
new file: .gitignore
new file: repo.csv

We can use the Git command “git ls-files” to confirm what is in in our staging area currently.

$ git ls-files
.gitignore
repo.csv

As confirmed, “log.txt” is ‘ignored’ by Git. Next step would be to commit our changes to our local repository. To do so, we’ll run the command below:

$git commit –m "First Commit"

As a good practice, the commit message in the quotes should be in present tense and should be an evocative summary of the changes that we are committing to the repository.

$ git commit -m "First Commit"
[main (root-commit) deed287] First Commit
2 files changed, 2 insertions(+)
create mode 100644 .gitignore
create mode 100644 repo.csv

Running “git status” will give us a confirmation that the first commit has been done successfully.

$ git status
On branch main
nothing to commit, working tree clean

To view our commit history, we can use the “git log” command. This Git command gives us a lot of details such as:

· Commit Hash which is a unique identifier for the commit and is a 40 characters hexadecimal string.

· Branch

· Author’s Name and Email Address who made the commit.

· Date and Time when the commit was made

· Commit Message

Using the commit hash, we can return to a previous state of the project code using git checkout or git restore — source commithash repo.csv

Remote Repository

Now that our project files have been committed to the local repository, the next step would be to upload them to the remote repository on Github. We first copy the repository URL from the website and use the command below:

$ git add remote origin "Repository URL"

The “origin” is the name we are giving to the remote repository. This sets up a link between the local repository and the remote repository named origin. We can confirm the remote repository using:

$ git remote -v
origin https://github.com/AJ-Coding101/First.git (fetch)
origin https://github.com/AJ-Coding101/First.git (push)

Now that we have confirmed our remote repository, we can push our changes accordingly. If it is the first time you’re pushing to the remote repo, you will be requested to input your Github username and password before continuing. To perform the push we use the command below:

$ git push –u origin main

“Origin” is the name we had given to our remote repository and “main” is the branch we are pushing to. The “-u” stands for upstream and sets the upstream branch as main. This is a shortcut that helps us not to specify the branch each time we want to pull or push in the future.

$ git push -u origin main
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (4/4), 271 bytes | 271.00 KiB/s, done.
Total 4 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/AJ-Coding101/First.git
* [new branch] main -> main
branch 'main' set up to track 'origin/main'.

Below is how our Github repository page looks like after performing the push:

Source: Author

We can see that our dataset “repo.csv” is now in the cloud and the “log.txt” file even though it was in our working directory, it is not immediately present in the main page of our repository but is located in the “.gitignore” file.

We can also see that we have made only 1 commit so far and we have 1 branch which is our default.

Source: Author

Branches

A branch is a reference to a particular commit in the repository’s history. You may work on a new feature or bug repair apart from the main codebase by creating a branch. This prevents changes to the original code. Once you’ve made changes and tested them, the branch can be merged back to the main code.

Source: Gitbookdown

To create a new branch, we can use the command below:

$git branch second

Second here will be the name of our new branch. To switch to the new branch we can use:

$ git checkout second
Switched to branch 'second'

Just to ensure that the branch exists, the following command will help:

$ git branch or $ git branch –l
* main
second

The asterisk (*) indicates the branch that we are currently working on.

$ git ls-files
.gitignore
repo.csv

Using the command above we can see that in the “second” branch, we still have our original files present. Let’s make some changes to our new branch.


$ touch dataset.csv

Source: Author

$ git status
On branch second
Untracked files:
(use "git add <file>…" to include in what will be committed)
dataset.csv
nothing added to commit but untracked files present (use "git add" to track)

We can see that we have a new untracked csv file “dataset” in our new branch. Let’s add the csv file to our staging area and commit it to our local repository.

$ git add dataset.csv
$ git commit -m "Add new dataset"
[second 00695d1] Add new dataset
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 dataset.csv

We can now view what files are present in our new branch as shown below:

$ git ls-files
.gitignore
dataset.csv
repo.csv

Let’s switch back to our main branch and view the files present.

$ git checkout main
$ git ls-files
.gitignore
dataset.csv
repo.csv

The new file “dataset.csv” is not present in our main branch as we can confirm from the command above thus showing that we can edit, make commits on a separate branch without affecting the main codebase.

We can merge our changes in the “second” branch to the “main” branch using the command below.

$ git merge second
Updating deed287..00695d1
Fast-forward
dataset.csv | 0
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 dataset.csv

Once we are done with a branch we can easily delete it using:

$ git branch –d second

To completely delete the branch from our remote repository we can use:

$ git push origin - delete second

Source: Author

Pull Requests

A pull request is a request to merge changes made in a branch into another branch, typically the main branch.

The process of making a pull request involves creating a new branch to contain the changes, making the changes and committing them to the branch, and then submitting a pull request to the destination branch. The pull request can then be reviewed by other contributors and the repository owner before the changes are merged into the destination branch.

It is important to note that the branch containing the changes must be pushed to the remote repository on Github before a pull request can be created. This allows the other contributors to access the changes and review the pull request.

Here are the basic steps for creating a pull request on Github:

Create a new branch using the method we described earlier.
Make changes: Make changes to the code, add files, or make other modifications.
Commit changes: Use the “git commit” command to commit the changes to the branch.
Push the branch: Use the “git push” command to push the branch to the remote repository on Github.
Create the pull request: Go to the Github repository and click on the “New pull request” button. Select the source and destination branches and create the pull request.
Review and merge: The pull request can now be reviewed by other contributors and the repository owner. Once approved, the changes can be merged into the destination branch.

CONCLUSION

Github is a powerful tool for version control and collaboration that can greatly benefit data scientists in particular. Data science involves working with complex datasets and code, often in teams, and Github provides an efficient way to manage and track changes to these files. By using Git and Github, data scientists can easily collaborate with their colleagues and keep track of changes to their code and datasets.

DEV Community

Github Guide for Data Scientists

Top comments (0)

Read next

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Mastering SQL Queries: A Comprehensive Guide for Beginners

Understanding and Resolving "fatal: The current branch has no upstream branch" Error in Git