This guide outlines how to correctly format CSV files so they can be used for dataset creation, training and inference within Butterfly AI predictive platform.


File format overview

  • Accepted format: .csv
  • Maximum file size: 500 MB
  • Minimum recommended rows: Several hundreds
  • Delimiter: Comma (,)

Input CSV creation guide

Continue reading this guidelines or watch this video for a walkthrough on how CSV data should look like:

Training CSV Format


The structure of the training file must follow this format:

  • First Row: Name of each column of features
  • First column: Unique ID (e.g., sample ID, row identifier)
  • Last column: Target/labe. Binary or multi-class, can be numeric (0, 1) or text (N, Y, Class1, Class4,…)
  • Middle columns: Feature data (numeric or string values)
  ID,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Target
1000025,5,A,1.767,1,Low,1,-3,1,1,N
1002945,5,A,4.367,5,Medium,10,3,2,1,P
1015425,3,B,1.98,1,Low,2,-7,1,1,N
...
  

Additional requirements for best results:

  • The smallest class represented in the CSV file should have at least 25 rows (labelled samples)
  • While up to 500 MB is supported, the recommendation is having no more than 250000 rows per training data CSV in for current release. Can go beyond, but processing may be significantly slower

Prediction CSV format


Your prediction CSV should have exactly the same format as your training data, except without the final target column. For example:

  ID,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8
3359991,10,D,1.565,2,High,6,-5,3,5
5618561,3,A,12.3,3,Low,11,-6,2,1
7199078,2,C,8.3,1,High,1,1,4,5
  

Common pitfalls to avoid


Null or missing values

Do not use NULL in your CSV. Instead, leave the cell blank (i.e., ,,).


Filenames with spaces

Avoid filenames like:

Milk Quality.csv

Use instead:

MilkQuality.csv


Multiple columns with unique identifiers

You must only have one ID-like column. If additional columns contain unique strings (like names), BAI may treat them as identifiers and reduce prediction quality:

  ID,Feature0,...,Feature3,...,Target
1000025,5,...,"Jack Smith",...,N
  

Sorting by a single feature

Do not pre-sort the data by any feature column (e.g. by Feature2), as this will reduce model accuracy. Random row order is essential. Example of poorly formatted file (sorted):

  ID,Feature0,Feature1,Feature2,...,Target
1017023,4,A,1.355,...,N
1035283,1,D,1.4,...,N
...
  

Summary


  • First column as the unique ID
  • Last column (for training only), target label (text or number). Empty for prediction CSV
  • Middle columns as features (numeric or categorical)
  • Max file size is 500 MB
  • Minimum row count to be mid to high hundreds
  • No NULLs, leave cells blank instead
  • File names without spaces
  • Avoid row ordering, shuffle them instead
  • Ensure only one unique ID column per file
  • No character or text in numeric columns: if a column in the in input csv file is intended to be numeric, please avoid having characters or text in that columns as data features. It should be purely numeric