to navigate

to select

to close

On this page

Input CSV File Format

Structure your training and prediction CSV files for optimal results

This guide outlines how to correctly format CSV files so they can be used for dataset creation, training and inference within Butterfly AI predictive platform.

File format overview

Accepted format: .csv
Maximum file size: 500 MB
Minimum recommended rows: Several hundreds
Delimiter: Comma (,)

Input CSV creation guide

Continue reading this guidelines or watch this video for a walkthrough on how CSV data should look like:

Training CSV Format

The structure of the training file must follow this format:

First Row: Name of each column of features
First column: Unique ID (e.g., sample ID, row identifier)
Last column: Target/labe. Binary or multi-class, can be numeric (0, 1) or text (N, Y, Class1, Class4,…)
Middle columns: Feature data (numeric or string values)

  ID,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Target
1000025,5,A,1.767,1,Low,1,-3,1,1,N
1002945,5,A,4.367,5,Medium,10,3,2,1,P
1015425,3,B,1.98,1,Low,2,-7,1,1,N
...

Additional requirements for best results:

The smallest class represented in the CSV file should have at least 25 rows (labelled samples)
While up to 500 MB is supported, the recommendation is having no more than 250000 rows per training data CSV in for current release. Can go beyond, but processing may be significantly slower

💡

It is highly recommended to shuffle the rows randomly to improve training performance

Prediction CSV format

Your prediction CSV should have exactly the same format as your training data, except without the final target column. For example:

  ID,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8
3359991,10,D,1.565,2,High,6,-5,3,5
5618561,3,A,12.3,3,Low,11,-6,2,1
7199078,2,C,8.3,1,High,1,1,4,5

💡

This file is referred also as blind data, unseen data

Common pitfalls to avoid

Null or missing values

Do not use NULL in your CSV. Instead, leave the cell blank (i.e., ,,).

Filenames with spaces

Avoid filenames like:

✗ Milk Quality.csv

Use instead:

✓ MilkQuality.csv

Multiple columns with unique identifiers

You must only have one ID-like column. If additional columns contain unique strings (like names), BAI may treat them as identifiers and reduce prediction quality:

  ID,Feature0,...,Feature3,...,Target
1000025,5,...,"Jack Smith",...,N

💡

Fix by removing or encoding the second identifier column

Sorting by a single feature

Do not pre-sort the data by any feature column (e.g. by Feature2), as this will reduce model accuracy. Random row order is essential. Example of poorly formatted file (sorted):

  ID,Feature0,Feature1,Feature2,...,Target
1017023,4,A,1.355,...,N
1035283,1,D,1.4,...,N
...

💡

Shuffle the dataset instead.

Summary

First column as the unique ID
Last column (for training only), target label (text or number). Empty for prediction CSV
Middle columns as features (numeric or categorical)
Max file size is 500 MB
Minimum row count to be mid to high hundreds
No NULLs, leave cells blank instead
File names without spaces
Avoid row ordering, shuffle them instead
Ensure only one unique ID column per file
No character or text in numeric columns: if a column in the in input csv file is intended to be numeric, please avoid having characters or text in that columns as data features. It should be purely numeric

💡

By following these guidelines, you ensure maximum compatibility and performance from the BAI platform

Getting started

First steps with Butterfly AI

Training

Prepare datasets, train …

Input CSV File Format

File format overview link

Input CSV creation guide link

Training CSV Format link

Prediction CSV format link

Common pitfalls to avoid link

Null or missing values link

Filenames with spaces link

Multiple columns with unique identifiers link

Sorting by a single feature link

Summary link