Filtering

Filtering a DataFrame by more than 1 column

Use axis = 1

df[df.apply(lambda row: (‘chrome’ in row[‘filename’] and row[‘my_time’] >= ‘2017-01-01’), axis = 1) ]

https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

d1[(d1[‘AMP’] == ‘AMP6’) & (df1[‘Portfolio’] == ‘Water Networks’)]

https://stackoverflow.com/questions/35186291/how-do-i-delete-rows-not-starting-with-x-in-pandas-or-keep-rows-starting-with

df = df[~df[‘c’].astype(str).str.startswith(‘1’)]

where ~ means is not (negation)

https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

df[df.C.str.contains(“XYZ”) == False]

Dropna (Pandas)

https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan

https://stackoverflow.com/questions/36370839/better-way-to-drop-nan-rows-in-pandas

df.dropna() #drop all rows that have any NaN values

df.dropna(how=’all’) #drop only if ALL columns are NaN

df.dropna(thresh=2) #Drop row if it does not have at least two values that are not NaN

df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question)

For reference - pandas.notnull , np.isfinite

df = df[pd.notnull(df[‘EPS’])]

df = df[np.isfinite(df[‘EPS’])]

Merge, Join, Concat (Pandas)

https://pandas.pydata.org/pandas-docs/stable/merging.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html

https://chrisalbon.com/python/pandas_join_merge_dataframe.html

https://stackoverflow.com/questions/41815079/pandas-merge-join-two-data-frames-on-multiple-columns

https://stackoverflow.com/questions/18792918/pandas-combining-2-data-frames-join-on-a-common-column

Read from Excel (Pandas)

https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook

https://stackoverflow.com/questions/24709108/python-pandas-how-could-i-read-excel-file-without-column-label-and-then-insert

df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = “B,C,G:L”)

names = [“PRN”, “Projects”, “Status”, “AMP”, “Portfolio”, “Programme”, “Type”, “Delivery Route”]

sheet_in = ‘Project Details (2)’

file_in = “PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx”

Old method

xl = pd.ExcelFile(“Path + filename”) df = xl.parse(“Sheet1”)

xls_file = pd.ExcelFile(‘PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx’) xls_file.sheet_names

https://stackoverflow.com/questions/17335649/how-to-specify-column-names-while-reading-an-excel-file-using-pandas

https://stackoverflow.com/questions/33655127/how-to-read-certain-columns-from-excel-using-pandas-python

import pandas as pd import numpy as np file_loc = “path.xlsx” df = pd.read_excel(file_loc, index_col=None, na_values=[‘NA’], parse_cols = “A,C:AA”) print(df)

parse_cols : int or list, default None If None then parse all columns, If int then indicates last column to be parsed If list of ints then indicates list of column numbers to be parsed If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

Limiting number of rows (Pandas)

https://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.head.html

https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

d1 = df.head(n=882)

(equivalent to SQL TOP)

https://stackoverflow.com/questions/23853553/python-pandas-how-to-read-only-first-n-rows-of-csv-files-in

If you only want to read the first 999,999 (non-header) rows:

read_csv(…, nrows=999999)

If you only want to read rows 1,000,000 … 1,999,999

read_csv(…, skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files*

skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you’ll probably also want to use chunksize:

chunksize : int, default None Return TextFileReader object for iteratio

https://stackoverflow.com/questions/35747476/in-pandas-whats-the-equivalent-of-nrows-from-read-csv-to-be-used-in-read-ex

If you know the number of rows in your Excel sheet, you can use the skip_footer parameter to read the first n - skip_footer rows of your file, where n is the total number of rows.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Usage:

data = pd.read_excel(filepath, header=0, parse_cols = “A:D”, skip_footer=80)

Assuming your excel sheet has 100 rows, this line would parse the first 20 rows.

I’d like to make (extend) @Erol’s answer bit more flexible.

Assuming that we DON’T know the total number of rows in the excel sheet:

xl = pd.ExcelFile(filepath)

parsing first (index: 0) sheet

total_rows = xl.book.sheet_by_index(0).nrows

skiprows = 4 nrows = 20

calc number of footer rows

(-1) - for the header row

skip_footer = total_rows - nrows - skiprows - 1

df = xl.parse(0, skiprows=skiprows, skip_footer=skip_footer, parse_cols=”A:D”) \ .dropna(axis=1, how=’all’)

.dropna(axis=1, how=’all’) will drop all columns containing only NaN’s

Export to CSV (Pandas)

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

d1.to_csv(‘PPM.csv’)

Import from Excel

df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = “B,C,G:L”)

Grouping column

https://stackoverflow.com/questions/44577367/python-dataframe-replace-not-recognizing-strings

filtered[“Grouping”] = filtered[“Name”].replace(list(self.map_group[“To Group”]), list(self.map_group[“Group To”]))

print(“a name” in list(self.map_group[“To Group”]))

https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

d1.groupby(‘Portfolio’).size()

d1.groupby([‘Portfolio’, ‘Type’]).size()

https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p

d1[d1.Type.str.contains(‘OPEX’) == False].groupby([‘Portfolio’, ‘Type’]).size()

Float precision IsNull mean

https://stackoverflow.com/questions/44912518/python-pandas-dataframe-converted-to-floats

round(average_petal_width, 2)

pd.set_option(‘precision’, 2)

average_petal_width = iris_data.loc[iris_dat[‘class’] == ‘Iris-setosa’, ‘petal_width_cm’].mean()

iris_data.loc[(iris_data[‘class’] == ‘Iris-setosa’) & (iris_Data[‘petal_width_cm’].isnull()), ‘petal_width_cm’] = average_petal_width

How to make good reproducible pandas examples

https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples

Joining dataframe ranges

https://stackoverflow.com/questions/44551864/joining-pandas-dataframe-on-hour-range

I have two Pandas dataframes, one with weather information, given hourly, and one with bus arrival data, given approximately every minute or so.

I want to join the dataframes so bus data is associated with weather for that hour e.g. bus data between 1pm and 2pm on 01/06/2012 will be associated with corresponding weather data.

pd.merge_asof(bus_df,weather_df, on=’DateTime’)

Running Total

https://stackoverflow.com/questions/45122444/pandas-row-addition

You can use cumsum and then plus the base number. (df[::-1].Total-df[::-1].Amount.cumsum()).max() gives you the total prior to the last row. you can then add that base to the cumsum results. Also, as your data is added from bottom up, you need to reverse the rows using df[::-1] first.

df[‘Total_New’]=df[::-1].Amount.cumsum()+(df[::-1].Total-df[::-1].Amount.cumsum()).max()

Drop_Row_with_conditions

From http://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression

To directly answer this question’s title (which I understand is not necessarily the OP’s problem but could help other users coming across this question) one way to do this is to use the drop method:

df = df.drop(some labels)

df = df.drop(df[].index)

Example

To remove all rows where column ‘score’ is < 50:

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

(see Boolean Indexing)

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

To remove all rows where column ‘score’ is < 50 and > 20

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)

Create_Dataframe_From_List_of_Dicts

pd.DataFrame(d)

Example:

import glob2, os, pandas as pd def process(file_name): pp= [] pp[‘filename’] = file_name pp[‘last_modified’] = os.path.getmtime(file_name) return pp

listing = [ process(x) for x in glob2.iglob(‘*.md’) ] df = pd.DataFrame(listing)