Reading An Arff File To Pandas Dataframe

Attribute-Relation File Format (ARFF) is a file format developed by the Machine Learning Project of the University of Waikato, New Zealand. It has been developed by the Computer Science department of the aforementioned University. The ARFF files mostly belong to WEKA (Waikato Environment for Knowledge Analysis), which is free software licensed under the GNU Free Public License. It is a collection of Machine Learning and Data Analysis tools.

In this article, we will see how we can convert an ARFF file into a Pandas data frame.

Prerequisites:

We will be using two modules here.

To install them, execute the following command –

pip install pandas
pip install scipy

Approach 1: Using Pandas and SciPy

Step – 1

After the installation of the required modules, we will import them.

Python3




import pandas as pd
from scipy.io import arff


We will use the loadarff() method of the arff class of the SciPy.io module. So the user can import them directly at the beginning, or load just the arff class and then use the loadarff method while needed.

Step – 2

Download an ARFF file from the Official WEKA website and keep it in the same directory as the python file. It would be easier to import it then. We will now use the loadarff() method to import the file which we have downloaded and store it in a variable.

Python3




# code
arff_file = arff.loadarff('/content/cpu.arff')


Step – 3

Now we will use the DataFrame method of the pandas library here to convert that ARFF file into pandas dataframe.

Python3




df = pd.DataFrame(arff_file[0])


Here inside the DataFrame() method we are passing the name of the file in which we have imported and stored the ARFF file and providing the index [0] to signify that the data is extracted from the first column of the arff file and then converted into a Pandas Dataframe.

Step – 4

Now we will use common pandas commands like head(), tail() etc to see if the arff file has been successfully converted into a dataframe or not.

Python3




df.head()


Output:

    MYCT    MMIN     MMAX   CACH  CHMIN  CHMAX  class
0 125.0 256.0 6000.0 256.0 16.0 128.0 198.0
1 29.0 8000.0 32000.0 32.0 8.0 32.0 269.0
2 29.0 8000.0 32000.0 32.0 8.0 32.0 220.0
3 29.0 8000.0 32000.0 32.0 8.0 32.0 172.0
4 29.0 8000.0 16000.0 32.0 8.0 16.0 132.0

Python3




df.tail()


Output:

      MYCT    MMIN    MMAX  CACH  CHMIN  CHMAX  class
204 124.0 1000.0 8000.0 0.0 1.0 8.0 42.0
205 98.0 1000.0 8000.0 32.0 2.0 8.0 46.0
206 125.0 2000.0 8000.0 0.0 2.0 14.0 52.0
207 480.0 512.0 8000.0 32.0 0.0 0.0 67.0
208 480.0 1000.0 4000.0 0.0 0.0 0.0 45.0

Python3




df['MYCT'].head(20)


Output:

0     125.0
1 29.0
2 29.0
3 29.0
4 29.0
5 26.0
6 23.0
7 23.0
8 23.0
9 23.0
10 400.0
11 400.0
12 60.0
13 50.0
14 350.0
15 200.0
16 167.0
17 143.0
18 143.0
19 110.0
Name: MYCT, dtype: float64

Approach – 2 : Using liac_arff and Pandas

We can use the liac_arff module alongside Pandas to import and convert an arff file into a Pandas DataFrame. Install the required modules first by executing the following command –

pip install liac-arff

Step – 1

After installing the required modules, we will import them.

Python3




import arff
import pandas as pd


Step – 2

After importing the required modules, we will use a variable in which we will import and store the arff file. We will use the loadarff() method of the ARFF module.

Python3




data, meta = arff.loadarff('/content/cpu.arff')


Here, the variable data has been used to load and open the ARFF file.

Step – 3

After that Convert the data to a Pandas DataFrame,

Python3




df = pd.DataFrame(data)


Here, the data variable will be converted to a dataframe.

Step – 4

Finally, we will print the data frame to see if it is working properly or not.

Python3




print(df)


Output:

      MYCT    MMIN     MMAX   CACH  CHMIN  CHMAX  class
0 125.0 256.0 6000.0 256.0 16.0 128.0 198.0
1 29.0 8000.0 32000.0 32.0 8.0 32.0 269.0
2 29.0 8000.0 32000.0 32.0 8.0 32.0 220.0
3 29.0 8000.0 32000.0 32.0 8.0 32.0 172.0
4 29.0 8000.0 16000.0 32.0 8.0 16.0 132.0
.. ... ... ... ... ... ... ...
204 124.0 1000.0 8000.0 0.0 1.0 8.0 42.0
205 98.0 1000.0 8000.0 32.0 2.0 8.0 46.0
206 125.0 2000.0 8000.0 0.0 2.0 14.0 52.0
207 480.0 512.0 8000.0 32.0 0.0 0.0 67.0
208 480.0 1000.0 4000.0 0.0 0.0 0.0 45.0
[209 rows x 7 columns]

Conclusion

We saw different approaches in this article of how we can read a file with the extension ARFF can be converted into a Pandas DataFrame. Some user may prefer the approach which involves Pandas and SciPy whereas some may like the second approach. The benefit of converting an ARFF file into a Pandas DataFrame because, it opens a whole new sea of opportunities of how to manipulate the information stored in that file. It also helps in cleaning them or sort them in more precise manner.



Contact Us