Filtering a DataFrame by more than 1 column

Use axis = 1

df[df.apply(lambda row: (‘chrome’ in row[‘filename’] and row[‘my_time’] >= ‘2017-01-01’), axis = 1) ]

d1[(d1[‘AMP’] == ‘AMP6’) & (df1[‘Portfolio’] == ‘Water Networks’)]

df = df[~df[‘c’].astype(str).str.startswith(‘1’)]

where ~ means is not (negation)

df[df.C.str.contains(“XYZ”) == False]

Dropna (Pandas)

df.dropna() #drop all rows that have any NaN values

df.dropna(how=’all’) #drop only if ALL columns are NaN

df.dropna(thresh=2) #Drop row if it does not have at least two values that are not NaN

df.dropna(subset=[1]) #Drop only if NaN in specific column (as asked in the question)

For reference - pandas.notnull , np.isfinite

df = df[pd.notnull(df[‘EPS’])]

df = df[np.isfinite(df[‘EPS’])]

Merge, Join, Concat (Pandas)

Read from Excel (Pandas)

df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = “B,C,G:L”)

names = [“PRN”, “Projects”, “Status”, “AMP”, “Portfolio”, “Programme”, “Type”, “Delivery Route”]

sheet_in = ‘Project Details (2)’

file_in = “PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx”

Old method

xl = pd.ExcelFile(“Path + filename”) df = xl.parse(“Sheet1”)

xls_file = pd.ExcelFile(‘PPM-Portfolio Performance Monitoring_dashboard New 2017_07_20.xlsx’) xls_file.sheet_names

import pandas as pd import numpy as np file_loc = “path.xlsx” df = pd.read_excel(file_loc, index_col=None, na_values=[‘NA’], parse_cols = “A,C:AA”) print(df)

parse_cols : int or list, default None If None then parse all columns, If int then indicates last column to be parsed If list of ints then indicates list of column numbers to be parsed If string then indicates comma separated list of column names and column ranges (e.g. “A:E” or “A,C,E:F”)

Limiting number of rows (Pandas)

d1 = df.head(n=882)

(equivalent to SQL TOP)

If you only want to read the first 999,999 (non-header) rows:

read_csv(…, nrows=999999)

If you only want to read rows 1,000,000 … 1,999,999

read_csv(…, skiprows=1000000, nrows=999999)

nrows : int, default None Number of rows of file to read. Useful for reading pieces of large files*

skiprows : list-like or integer Row numbers to skip (0-indexed) or number of rows to skip (int) at the start of the file

and for large files, you’ll probably also want to use chunksize:

chunksize : int, default None Return TextFileReader object for iteratio

If you know the number of rows in your Excel sheet, you can use the skip_footer parameter to read the first n - skip_footer rows of your file, where n is the total number of rows.


data = pd.read_excel(filepath, header=0, parse_cols = “A:D”, skip_footer=80)

Assuming your excel sheet has 100 rows, this line would parse the first 20 rows.

I’d like to make (extend) @Erol’s answer bit more flexible.

Assuming that we DON’T know the total number of rows in the excel sheet:

xl = pd.ExcelFile(filepath)

parsing first (index: 0) sheet

total_rows =

skiprows = 4 nrows = 20

calc number of footer rows

(-1) - for the header row

skip_footer = total_rows - nrows - skiprows - 1

df = xl.parse(0, skiprows=skiprows, skip_footer=skip_footer, parse_cols=”A:D”) \ .dropna(axis=1, how=’all’)

.dropna(axis=1, how=’all’) will drop all columns containing only NaN’s

Export to CSV (Pandas)


Import from Excel

df = pd.read_excel(file_in,sheetname=sheet_in, header=None, names=names, skiprows=73, parse_cols = “B,C,G:L”)

Grouping column

filtered[“Grouping”] = filtered[“Name”].replace(list(self.map_group[“To Group”]), list(self.map_group[“Group To”]))

print(“a name” in list(self.map_group[“To Group”]))


d1.groupby([‘Portfolio’, ‘Type’]).size()

d1[d1.Type.str.contains(‘OPEX’) == False].groupby([‘Portfolio’, ‘Type’]).size()

Float precision IsNull mean

round(average_petal_width, 2)

pd.set_option(‘precision’, 2)

average_petal_width = iris_data.loc[iris_dat[‘class’] == ‘Iris-setosa’, ‘petal_width_cm’].mean()

iris_data.loc[(iris_data[‘class’] == ‘Iris-setosa’) & (iris_Data[‘petal_width_cm’].isnull()), ‘petal_width_cm’] = average_petal_width

How to make good reproducible pandas examples

Joining dataframe ranges

I have two Pandas dataframes, one with weather information, given hourly, and one with bus arrival data, given approximately every minute or so.

I want to join the dataframes so bus data is associated with weather for that hour e.g. bus data between 1pm and 2pm on 01/06/2012 will be associated with corresponding weather data.

pd.merge_asof(bus_df,weather_df, on=’DateTime’)

Running Total

You can use cumsum and then plus the base number. (df[::-1].Total-df[::-1].Amount.cumsum()).max() gives you the total prior to the last row. you can then add that base to the cumsum results. Also, as your data is added from bottom up, you need to reverse the rows using df[::-1] first.




To directly answer this question’s title (which I understand is not necessarily the OP’s problem but could help other users coming across this question) one way to do this is to use the drop method:

df = df.drop(some labels)

df = df.drop(df[].index)


To remove all rows where column ‘score’ is < 50:

df = df.drop(df[df.score < 50].index)

In place version (as pointed out in comments)

df.drop(df[df.score < 50].index, inplace=True)

Multiple conditions

(see Boolean Indexing)

The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

To remove all rows where column ‘score’ is < 50 and > 20

df = df.drop(df[(df.score < 50) & (df.score > 20)].index)




import glob2, os, pandas as pd def process(file_name): pp= [] pp[‘filename’] = file_name pp[‘last_modified’] = os.path.getmtime(file_name) return pp

listing = [ process(x) for x in glob2.iglob(‘*.md’) ] df = pd.DataFrame(listing)