String handling in pandas

String handling methods are indicated by using the .str notation.

Example 1:

df_client["Subject"].str.rsplit("-", 1, expand=True)[1]

This will split the Subject column based on the last occurance of the "-" character, return a list object of the split string (expand), and then select the last element of the list.

Example 2:

df_client["Subject"].str.strip()

This will remove any surrounding whitespace from the elements of the Subject column.

Methods for selecting the last character

RPartition

df_egencia["Subject"].str.rpartition("-")[2].str.strip()

RPartition returns a list (similar to rsplit's expand=True). The element [2] will be the split off item.

RSplit with expand=True

df_egencia["Subject"].str.rsplit("-", 1, expand=True)[1]

expand=True returns a list.

RSplit with str index

df_client["Subject"].str.rsplit("-",1).str[-1]

.str instructs pandas to treat the entry as a string?
It allows indexing on the list, with [-1] returning the last item as normal
See indexing-with-str

List comprehension

df_client['last_col'] = [x.rsplit("-", 1)[-1] for x in df_client["Subject"]]

Note: This will automatically add the list to the dataframe as a column.

Column handling

Rename

Renaming columns in pandas

Rename all columns

df.columns = ['a', 'b']

Rename one or more columns

df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
# Or rename the existing DataFrame (rather than creating a copy) 
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

Filtering

Filtering more than one column or predicate

The boolean AND and OR can be used to join conditions together, using the following symbols:

& AND
| OR
~ NOT

Filtering using string functions

str.contains for partial matches
str.match for a full match

Examples:

df.str.contains("Good ")  
df.str.match("Valid")

Filtering

Filtering a DataFrame by more than 1 column

Note: axis refers to whether an action is row-based (0, default) or column-based (axis=1).

Use axis = 1

df[df.apply(lambda row: ('chrome' in row['filename'] and row['my_time'] >= '2017-01-01'), axis = 1) ]

https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

d1[(d1['AMP'] == 'AMP6') & (df1['Portfolio'] == 'Water Networks')]

https://stackoverflow.com/questions/35186291/how-do-i-delete-rows-not-starting-with-x-in-pandas-or-keep-rows-starting-with

df = df[~df['c'].astype(str).str.startswith('1')]

where ~ means is not (negation)

https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

df[df.C.str.contains("XYZ") == False]

Dropna (Pandas)

https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan

https://stackoverflow.com/questions/36370839/better-way-to-drop-nan-rows-in-pandas

df.dropna() #drop all rows that have any NaN values

df.dropna(how='all') #drop only if ALL columns are NaN

df.dropna(thresh=2) #Drop row if it does not have at least two values that are not NaN

df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question)

For reference - pandas.notnull , np.isfinite

df = df[pd.notnull(df['EPS'])]

df = df[np.isfinite(df['EPS'])]

Merge, Join, Concat (Pandas)

https://pandas.pydata.org/pandas-docs/stable/merging.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html

https://chrisalbon.com/python/pandas_join_merge_dataframe.html

https://stackoverflow.com/questions/41815079/pandas-merge-join-two-data-frames-on-multiple-columns

https://stackoverflow.com/questions/18792918/pandas-combining-2-data-frames-join-on-a-common-column

Read from Excel (Pandas)

https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook

https://stackoverflow.com/questions/24709108/python-pandas-how-could-i-read-excel-file-without-column-label-and-then-insert

df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = "B,C,G:L")

names = ["PRN", "Projects", "Status", "AMP", "Portfolio", "Programme", "Type", "Delivery Route"]

sheet_in = 'Project Details (2)'

file_in = "PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx"

Old method

xl = pd.ExcelFile("Path + filename") df = xl.parse("Sheet1")

xls_file = pd.ExcelFile('PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx') xls_file.sheet_names

https://stackoverflow.com/questions/17335649/how-to-specify-column-names-while-reading-an-excel-file-using-pandas

https://stackoverflow.com/questions/33655127/how-to-read-certain-columns-from-excel-using-pandas-python

import pandas as pd import numpy as np file_loc = "path.xlsx" df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = "A,C:AA") print(df)

parse_cols : int or list, default None If None then parse all columns, If int then indicates last column to be parsed If list of ints then indicates list of column numbers to be parsed If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

Limiting number of rows (Pandas)

https://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.head.html

https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

d1 = df.head(n=882)

(equivalent to SQL TOP)

https://stackoverflow.com/questions/23853553/python-pandas-how-to-read-only-first-n-rows-of-csv-files-in

If you only want to read the first 999,999 (non-header) rows:

read_csv(..., nrows=999999)

If you only want to read rows 1,000,000 ... 1,999,999

read_csv(..., skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files*

skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you'll probably also want to use chunksize:

chunksize : int, default None Return TextFileReader object for iteratio

https://stackoverflow.com/questions/35747476/in-pandas-whats-the-equivalent-of-nrows-from-read-csv-to-be-used-in-read-ex

If you know the number of rows in your Excel sheet, you can use the skip_footer parameter to read the first n - skip_footer rows of your file, where n is the total number of rows.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Usage:

data = pd.read_excel(filepath, header=0, parse_cols = "A:D", skip_footer=80)

Assuming your excel sheet has 100 rows, this line would parse the first 20 rows.

I'd like to make (extend) @Erol's answer bit more flexible.

Assuming that we DON'T know the total number of rows in the excel sheet:

xl = pd.ExcelFile(filepath)

parsing first (index: 0) sheet

total_rows = xl.book.sheet_by_index(0).nrows

skiprows = 4 nrows = 20

calc number of footer rows

(-1) - for the header row

skip_footer = total_rows - nrows - skiprows - 1

df = xl.parse(0, skiprows=skiprows, skip_footer=skip_footer, parse_cols="A:D") \ .dropna(axis=1, how='all')

.dropna(axis=1, how='all') will drop all columns containing only NaN's

Export to CSV (Pandas)

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

d1.to_csv('PPM.csv')

Import from Excel

df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = "B,C,G:L")

Grouping column

https://stackoverflow.com/questions/44577367/python-dataframe-replace-not-recognizing-strings

filtered["Grouping"] = filtered["Name"].replace(list(self.map_group["To Group"]), list(self.map_group["Group To"]))

print("a name" in list(self.map_group["To Group"]))

https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

d1.groupby('Portfolio').size()

d1.groupby(['Portfolio', 'Type']).size()

https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

d1[d1.Type.str.contains('OPEX') == False].groupby(['Portfolio', 'Type']).size()

Float precision IsNull mean

https://stackoverflow.com/questions/44912518/python-pandas-dataframe-converted-to-floats

round(average_petal_width, 2)

pd.set_option('precision', 2)

average_petal_width = iris_data.loc[iris_dat['class'] == 'Iris-setosa', 'petal_width_cm'].mean()

iris_data.loc[(iris_data['class'] == 'Iris-setosa') & (iris_Data['petal_width_cm'].isnull()), 'petal_width_cm'] = average_petal_width

How to make good reproducible pandas examples

https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples

Joining dataframe ranges

https://stackoverflow.com/questions/44551864/joining-pandas-dataframe-on-hour-range

I have two Pandas dataframes, one with weather information, given hourly, and one with bus arrival data, given approximately every minute or so.

I want to join the dataframes so bus data is associated with weather for that hour e.g. bus data between 1pm and 2pm on 01/06/2012 will be associated with corresponding weather data.

pd.merge_asof(bus_df,weather_df, on='DateTime')

Running Total

https://stackoverflow.com/questions/45122444/pandas-row-addition

You can use cumsum and then plus the base number. (df[::-1].Total-df[::-1].Amount.cumsum()).max() gives you the total prior to the last row. you can then add that base to the cumsum results. Also, as your data is added from bottom up, you need to reverse the rows using df[::-1] first.

df['Total_New']=df[::-1].Amount.cumsum()+(df[::-1].Total-df[::-1].Amount.cumsum()).max()

Drop_Row_with_conditions

From http://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression

To directly answer this question's title (which I understand is not necessarily the OP's problem but could help other users coming across this question) one way to do this is to use the drop method:

df = df.drop(some labels)

df = df.drop(df[].index)

Example

To remove all rows where column 'score' is < 50:

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

(see Boolean Indexing)

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

To remove all rows where column 'score' is < 50 and > 20

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)

Create_Dataframe_From_List_of_Dicts

pd.DataFrame(d)

Example:

import glob2, os, pandas as pd def process(file_name): pp= [] pp['filename'] = file_name pp['last_modified'] = os.path.getmtime(file_name) return pp

listing = [ process(x) for x in glob2.iglob('*.md') ] df = pd.DataFrame(listing)