banner



How To Upload Csv In Python And Manipulate

CSV (comma-separated value) files are a common file format for transferring and storing data. The ability to read, manipulate, and write information to and from CSV files using Python is a central skill to master for any data scientist or business concern analysis. In this mail service, we'll become over what CSV files are, how to read CSV files into Pandas DataFrames, and how to write DataFrames back to CSV files post analysis.

Pandas is the nearly pop data manipulation package in Python, and DataFrames are the Pandas data blazon for storing tabular 2D data.

  1. Load CSV files to Python Pandas
  2. 1. File Extensions and File Types
  3. 2. Information Representation in CSV files
    • Other Delimiters / Separators – TSV files
    • Delimiters in Text Fields – Quotechar
  4. 3. Python – Paths, Folders, Files
    • Finding your Python Path
    • File Loading: Absolute and Relative Paths
  5. four. Pandas CSV File Loading Errors
  6. Advanced Read CSV Files
    • Specifying Data Types
    • Skipping and Picking Rows and Columns From File
    • Custom Missing Value Symbols
  7.  CSV Format Advantages and Disadvantages
  8. Boosted Reading

Load CSV files to Python Pandas

The basic process of loading data from a CSV file into a Pandas DataFrame (with all going well) is achieved using the "read_csv" function in Pandas:

# Load the Pandas libraries with alias 'pd'  import pandas as pd   # Read data from file 'filename.csv'  # (in the same directory that your python procedure is based) # Command delimiters, rows, column names with read_csv (see later)  data = pd.read_csv("filename.csv")   # Preview the start five lines of the loaded information  data.head()

While this code seems simple, an understanding of three primal concepts is required to fully grasp and debug the functioning of the information loading procedure if you run into issues:

  1. Agreement file extensions and file types – what practise the letters CSV actually hateful? What's the departure between a .csv file and a .txt file?
  2. Understanding how information is represented inside CSV files – if you open a CSV file, what does the data really look like?
  3. Understanding the Python path and how to reference a file – what is the accented and relative path to the file you are loading? What directory are you working in?
  4. CSV data formats and errors – common errors with the role.

Each of these topics is discussed below, and we terminate this tutorial by looking at some more avant-garde CSV loading mechanisms and giving some broad advantages and disadvantages of the CSV format.

i. File Extensions and File Types

The get-go step to working with comma-separated-value (CSV) files is understanding the concept of file types and file extensions.

  1. Data is stored on your calculator in individual "files", or containers, each with a different proper noun.
  2. Each file contains data of different types – the internals of a Discussion document is quite different from the internals of an epitome.
  3. Computers determine how to read files using the "file extension", that is the code that follows the dot (".") in the filename.
  4. So, a filename is typically in the form "<random proper name>.<file extension>". Examples:
    • project1.DOCX – a Microsoft Word file called Project1.
    • shanes_file.TXT – a simple text file called shanes_file
    • IMG_5673.JPG – An image file called IMG_5673.
    • Other well known file types and extensions include: XLSX: Excel, PDF: Portable Document Format, PNG – images, Cypher – compressed file format, GIF – animation, MPEG – video, MP3 – music etc. See a consummate list of extensions hither.
  5. A CSV file is a file with a ".csv" file extension, e.g. "data.csv", "super_information.csv". The "CSV" in this case lets the estimator know that the data contained in the file is in "comma separated value" format, which we'll discuss below.

File extensions are hidden by default on a lot of operating systems. The outset step that any self-respecting engineer, software engineer, or data scientist will do on a new computer is to ensure that file extensions are shown in their Explorer (Windows) or Finder (Mac) windows.

Folder with file extensions showing. Earlier working with CSV files, ensure that you tin meet your file extensions in your operating system. Different file contents are denoted by the file extension, or letters later on the dot, of the file name. due east.grand. TXT is text, DOCX is Microsoft Word, PNG are images, CSV is comma-separated value data.

To check if file extensions are showing in your system, create a new text document with Notepad (Windows) or TextEdit (Mac) and save it to a folder of your selection. If yous can't run across the ".txt" extension in your folder when you view information technology, you will have to alter your settings.

  • In Microsoft Windows: Open Control Panel > Advent and Personalization.  Now, click on Folder Options or File Explorer Option, as it is now called > View tab. In this tab, nether Advance Settings, you will see the option Hide extensions for known file types. Uncheck this selection and click on Apply and OK.
  • In Mac Bone: Open up Finder > In menu, click Finder > Preferences, Click Advanced, Select the checkbox for "Evidence all filename extensions".

2. Data Representation in CSV files

A "CSV" file, that is, a file with a "csv" filetype, is a basic text file. Whatever text editor such every bit NotePad on windows or TextEdit on Mac, can open a CSV file and prove the contents. Sublime Text is a wonderful and multi-functional text editor option for whatever platform.

CSV is a standard for storing tabular data in text format, where commas are used to separate the different columns, and newlines (wagon return / press enter) used to separate rows. Typically, the first row in a CSV file contains the names of the columns for the data.

And instance table data gear up and the respective CSV-format data is shown in the diagram below.

Pandas read csv function read_csv is used to process this comma-separated file into tabular format in the Python DataFrame. Here we look at the innards of a CSV file to examine how columns are specified.
Comma-separated value files, or CSV files, are simple text files where commas and newlines are used to define tabular information in a structured way.

Note that about any tabular data can be stored in CSV format – the format is popular considering of its simplicity and flexibility. Y'all can create a text file in a text editor, salve it with a .csv extension, and open up that file in Excel or Google Sheets to come across the table class.

Other Delimiters / Separators – TSV files

The comma separation scheme is past far the most popular method of storing tabular data in text files.

Withal, the pick of the ',' comma character to delimiters columns, nevertheless, is arbitrary, and tin can be substituted where needed. Popular alternatives include tab ("\t") and semi-colon (";"). Tab-separate files are known as TSV (Tab-Separated Value) files.

When loading data with Pandas, the read_csv part is used for reading any delimited text file, and past changing the delimiter using the sep  parameter.

Delimiters in Text Fields – Quotechar

1 complication in creating CSV files is if yous accept commas, semicolons, or tabs actually in ane of the text fields that you want to store. In this case, it's of import to utilize a "quote character" in the CSV file to create these fields.

The quote character can be specified in Pandas.read_csv using the quotechar argument. By default (equally with many systems), it's set as the standard quotation marks ("). Any commas (or other delimiters as demonstrated below) that occur between ii quote characters will exist ignored equally column separators.

In the instance shown, a semicolon-delimited file, with quotation marks as a quotechar is loaded into Pandas, and shown in Excel. The use of the quotechar allows the "NickName" column to incorporate semicolons without being split into more columns.

" data-medium-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-300x215.png" data-large-file="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-1024x734.png" src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png" alt="Demonstration of semicolon separated file data with quote character to prevent unnecessary splits in columns. Read this CSV file with pandas using read_csv with the ";" sep specified." class="wp-image-1103" width="818" height="586" data-old-src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%20818%20586'%3E%3C/svg%3E" data-lazy-src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/07/Other-delimiters-Text-file-e1530995690282.png">
Other than commas in CSV files, Tab-separated and Semicolon-separated data is popular also. Quote characters are used if the data in a cavalcade may contain the separating grapheme. In this case, the 'NickName' cavalcade contains semicolon characters, and so this column is "quoted". Specify the separator and quote graphic symbol in pandas.read_csv

three. Python – Paths, Folders, Files

When yous specify a filename to Pandas.read_csv, Python will look in your "current working directory". Your working directory is typically the directory that you started your Python procedure or Jupyter notebook from.

When filenotfounderrors occur, it can be due to a misspelled filename or a working directory mistake,
Pandas searches your 'current working directory' for the filename that you specify when opening or loading files. The FileNotFoundError tin can be due to a misspelled filename, or an wrong working directory.

Finding your Python Path

Your Python path can be displayed using the congenital-in os module. The OS module is for operating organization dependent functionality into Python programs and scripts.

To detect your current working directory, the office required is bone.getcwd(). Theos.listdir() role can be used to display all files in a directory, which is a proficient cheque to run into if the CSV file yous are loading is in the directory as expected.

# Find out your current working directory import os print(bone.getcwd())  # Out: /Users/shane/Documents/weblog  # Display all of the files found in your current working directory print(os.listdir(os.getcwd())   # Out: ['test_delimted.ssv', 'CSV Web log.ipynb', 'test_data.csv']

In the example in a higher place, my electric current working directory is in the '/Users/Shane/Document/weblog' directory. Any files that are places in this directory will exist immediately available to the Python file open() role or the Pandas read csv function.

Instead of moving the required data files to your working directory, you can also modify your current working directory to the directory where the files reside usingos.chdir().

File Loading: Accented and Relative Paths

When specifying file names to the read_csv part, you can supply both absolute or relative file paths.

  • A relative pathis the path to the file if you showtime from your electric current working directory. In relative paths, typically the file will be in a subdirectory of the working directory and the path will not start with a drive specifier, e.1000. (data/test_file.csv). The characters '..' are used to movement to a parent directory in a relative path.
  • An absolute pathis the complete path from the base of operations of your file system to the file that you want to load, eastward.g. c:/Documents/Shane/data/test_file.csv. Absolute paths will start with a drive specifier (c:/ or d:/ in Windows, or '/' in Mac or Linux)

It's recommended and preferred to use relative paths where possible in applications, because absolute paths are unlikely to work on unlike computers due to dissimilar directory structures.

absolute vs relative file paths
Loading the same file with Pandas read_csv using relative and absolute paths. Relative paths are directions to the file starting at your current working directory, where absolute paths always start at the base of your file system.

4. Pandas CSV File Loading Errors

The most common error's you'll get while loading data from CSV files into Pandas volition be:

  1. FileNotFoundError: File b'filename.csv' does not be
    A File Not Found mistake is typically an event with path setup, electric current directory, or file proper noun confusion (file extension can play a part hither!)
  2. UnicodeDecodeError: 'utf-8' codec can't decode byte in position : invalid continuation byte
    A Unicode Decode Error is typically caused past non specifying the encoding of the file, and happens when you lot have a file with non-standard characters. For a quick fix, endeavor opening the file in Sublime Text, and re-saving with encoding 'UTF-viii'.
  3. pandas.parser.CParserError: Error tokenizing data.
    Parse Errors tin can be caused in unusual circumstances to do with your data format – try to add the parameter "engine='python'" to the read_csv function telephone call; this changes the information reading function internally to a slower simply more stable method.

Advanced Read CSV Files

In that location are some boosted flexible parameters in the Pandas read_csv() function that are useful to accept in your armory of data scientific discipline techniques:

Specifying Information Types

As mentioned before, CSV files practice not contain any type data for data. Data types are inferred through examination of the top rows of the file, which can lead to errors. To manually specify the data types for different columns, thedtype parameter can be used with a dictionary of column names and data types to exist applied, for example:dtype={"name": str, "age": np.int32}.

Note that for dates and date times, the format, columns, and other behaviour can exist adjusted using parse_dates, date_parser, dayfirst, keep_dateparameters.

Skipping and Picking Rows and Columns From File

Thenrows parameter specifies how many rows from the meridian of CSV file to read, which is useful to take a sample of a large file without loading completely. Similarly theskiprowsparameter allows you to specify rows to leave out, either at the start of the file (provide an int), or throughout the file (provide a listing of row indices). Similarly, theusecolsparameter can exist used to specify which columns in the data to load.

Custom Missing Value Symbols

When data is exported to CSV from different systems, missing values can exist specified with dissimilar tokens. Thena_values parameter allows you to customise the characters that are recognised as missing values. The default values interpreted every bit NA/NaN are: '', '#Northward/A', '#N/A N/A', '#NA', '-1.#IND', '-one.#QNAN', '-NaN', '-nan', 'ane.#IND', 'one.#QNAN', 'N/A', 'NA', 'Aught', 'NaN', 'northward/a', 'nan', 'null'.

# Advanced CSV loading example  data = pd.read_csv(     "information/files/complex_data_example.tsv",      # relative python path to subdirectory     sep='\t' 					# Tab-separated value file.     quotechar="'",				# single quote allowed every bit quote character     dtype={"salary": int}, 		        # Parse the salary column equally an integer      usecols=['name', 'birth_date', 'salary'].   # Only load the three columns specified.     parse_dates=['birth_date'], 		# Intepret the birth_date column equally a date     skiprows=x, 				# Skip the first 10 rows of the file     na_values=['.', '??'] 			# Take any '.' or '??' values every bit NA )

 CSV Format Advantages and Disadvantages

As with all technical decisions, storing your data in CSV format has both advantages and disadvantages. Be enlightened of the potential pitfalls and issues that yous volition encounter as you lot load, store, and exchange data in CSV format:

On the plus side:

  • CSV format is universal and the data tin be loaded by most any software.
  • CSV files are simple to understand and debug with a basic text editor
  • CSV files are quick to create and load into retentivity before analysis.

However, the CSV format has some negative sides:

  • There is no data blazon information stored in the text file, all typing (dates, int vs float, strings) are inferred from the information only.
  • There'due south no formatting or layout data storable – things similar fonts, borders, cavalcade width settings from Microsoft Excel will exist lost.
  • File encodings can become a problem if there are non-ASCII uniform characters in text fields.
  • CSV format is inefficient; numbers are stored as characters rather than binary values, which is wasteful. You lot will find notwithstanding that your CSV data compresses well using zilch compression.

Equally and aside, in an effort to counter some of these disadvantages, two prominent data science developers in both the R and Python ecosystems, Wes McKinney and Hadley Wickham, recently introduced the Feather Format, which aims to be a fast, simple, open, flexible and multi-platform information format that supports multiple data types natively.

Additional Reading

  1. Official Pandas documentation for the read_csv function.
  2. Python 3 Notes on file paths, working directories, and using the Bone module.
  3. Datacamp Tutorial on loading CSV files, including some additional OS commands.
  4. PythonHow Loading CSV tutorial.

Source: https://www.shanelynn.ie/python-pandas-read-csv-load-data-from-csv-files/

Posted by: marksthicess.blogspot.com

0 Response to "How To Upload Csv In Python And Manipulate"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel